Mutual Fund Data Analysis

Peter (mr.peter.lee@hotmail.com)

Last Update: 31 Dec 2015

Version: 1.0

The file (Analysis.ipynb) is written protected to prevent changes. To make changes or execute code blocks, please make a copy of the file by clicking the mouse button twice:

Click on "File"
Click on "Make a Copy"

FAQs

How to inspect dataset quickly?

Method 1: Use clipboard (suitable for smaller datasets & subject to internet speed).
- find the name of the DataFrame, e.g., data_main
- run the command in any cell block: data_main.to_clipboard()
- launch Excel and paste in the data
Method 2: Export to a file and use WinSCP to retrieve the file from the remote server (suitable for large files).
- find the name of the DataFrame, e.g., data_main
- run the command in any cell block: data_main.to_excel("data/output/temp.xlsx")
- launch WinSCP to locate the file and download to local computer

To-do list

[x] completed [ ] to-do

To get an overview of the date first:
- [x] One Table for all fields of the data consisting of Numbers, Means, Medians, Stddev, Quantiles, Skewness, Kurtosis, …. Date of first observation and last,….
- [x] This table separately for all single funds (over the total period)
- [x] This table separately for all single years (over all funds)
- [x] A list of all values for all definitions (like buy 10 and sell 20) found in the database together with the number of each of them. Do we know the meaning of all of them?
- [x] The table again separately for all different definitions.
- [x] A plot of aggregated daily buys and sells (and perhaps others) over the time.
Daily holdings of all funds
- [x] Theoretically this data will allow you to produce all holdings and stock and funds values for all funds on a daily base.
- [ ] If this works you could check your holdings against the holdings we got from Bloomberg or Morningstar for on specific point in time. Then you could produce similar tabs as under 1. I think if you are able to get this data, you have done a great job – and typically it takes longer than expected…..
- [Peter] Checking for problems in the computed holdings is time consuming. I will move on to other work for now and come back to revise the algorithm in a few days.



In [1]:

    
import pandas as pd
import numpy as np
import datetime as dt
from pandas import Series, DataFrame, Panel, datetime
from datetime import timedelta
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
%matplotlib inline
import io
import os
import random
import statsmodels.stats.api as sms
from warnings import warn
from pylab import rcParams
rcParams['figure.figsize'] = 10, 5  # Change default plot size
from ipywidgets import FloatProgress  # Progress bar - f = FloatProgress(min=0, max=1); display(f); then in a loop: f.value = percentage
from IPython.display import display  # Progress bar

# ****************** Program Settings ******************

Data_Folder = "data/new/"
Output_Folder = "data/output/"
Datafile_Main = "fin_mf_trades_1995_2014_fid.dta"  # The main dataset - mutual fund holdings
Datafile_stocks = "compustat_finland_1990_2015_all.dta"  # The dataset - stock prices / dividends

# Calculate Daily Holdings
Balanced_Panel = True  # Default: True
                       # True if add extra entries to the holdings data so that there is at least one observation per day;
                       # If there is no transaction in a day --> 1 entry in the day
                       # If there is n transactions --> n entries (n>0)
Dividends_Reinvested = True  # True if dividends are assumed to be fully reinvested; False otherwise.
                             # Note that the Barber's method excludes dividends
Adjust_Price_Before_1999 = True  # True if to divide 'price' before 1999 by 5.94573 
Vars_To_Keep = ['fund_id', 'isin', 'seq', 'acc', 'sect', 'owntype', 'legtype', 'ref_code', 
                'buy_sell', 'volume', 'price', 'date']
                           # To preserve memory usage, select the key variables to keep in the data.
                           # Note that 'date' is actually 'trade_date'
        
# ******************************************************

Variable Definitions - data_main

"data_main" is the mutual fund holdings dataset.

date: TR_DATE from the original dataset.
fund_id: identifier for funds as in Bloomberg
acc: identifier for funds in the Euroclear Data
sect: category of what types of invstors
buy_sell: 10 for buys; 20 for sells;
class: same as above (mutual funds are FI)
- HH: Household
- FI: Financial institution (domestic)
- FR: Foreign with own account at euroclear
- FI: Foreign nominee account where most foreign holdings are registered
- NF: Non-financial
- GG: General government
- NP: Non-for-profit
leg_type: Legal type of the account holder
- 11 NATURAL PERSON
- 12 ESTATE OF A DECEASED PERSON
- 13 TAXATION CONSORTIUM
- 14 PARTNERSHIP
- 15 BANKRUPT'S ESTATE
- 21 LIMITED PARTNERSHIP
- 22 LAIVAISÄNNISTÖYHTIÖ (EI OSAKEYHTIÖ)
- 30 PUBLIC LIMITED COMPANY
- 31 LIMITED COMPANY
- 32 MUTUAL INSURANCE COMPANY
- 33 SAVINGS BANK
- 34 PENSION TRUST OR FUND
- 35 HOUSING CORPORATION
- 41 COOPERATIVE
- 51 FOUNDATION, FUND
- 52 NON-PROFIT ASSOCIATION
- 53 MUTUAL INDEMNITY INSURANCE ASSOCIATION
- 54 ECONOMIC ASSOCIATION
- 61 PUBLIC AUTHORITY
- 62 PUBLIC UTILITY
- 63 PUBLIC CORPORATION
- 71 STATE CHURCH
- 72 OTHER RELIGIOUS COMMUNITY
- 90 OTHER JURIDICAL FORM
ref_code: Reference code of the transaction
- 01 MARKETPLACE TRADE
- 02 OTHER TRADE
- 03 TRANSFER TO BOOK-ENTRIES
- 04 TRANSFER TO OWN ACCOUNT
- 05 CONVERSION
- 06 BEQUEST
- 07 INHERITANCE
- 08 WILL
- 09 DIVISION OF HTE PROPERTY OF THE SPOU
- 10 FORCED SALE
- 11 TRANSFER INT. CERTIFICATE OF OWN
- 12 TRANSFER OF INTERNAT. CERTIFICATE
- 13 TRANSFER TO NOM.REG B-E ACCOUNT
- 14 TRANSFER FROM NOM.REG B-E ACCOUNT
- 15 TRADE-BY-TRADE SETTLEMENT
- 16 DIRECT SELLING
- 17 TRANSFER OF COLLATERAL
- 18 RETURN OF COLLATERAL
- 19 CANCELLATION OF RELATIONSHIP BASED O
- 20 ISSUE ACCOUNT TRANSACTION
- 50 RIGHTS ISSUE
- 51 BONUS ISSUE
- 52 SUBSCRIPTION
- 53 FORFEITURE
- 54 REDEMPTION
- 55 SPLIT
- 56 MERGER
- 57 COMBINATION
- 58 CANCELLATION OF A SECURITY
- 59 OTHER SUBSCRIPTION
- 60 SUBSCRIPTION RIGHT
- 61 TRANSFER OF RECORD TIME
- 62 SECURITY TYPE CONVERSION
- 63 CANCELLING OF SECURITY TYPE LOAN
- 64 REVERSE SPLIT
- 65 CANCELLATION OF SUBSCRIPTION
- 71 CONVERSION OF CONVERTIBLE BOND
- 72 SUBSCRIPTION WITH WARRANT
- 73 EXECUTION OF WARRANT
- 75 DEMERGER
- 76 STOCK LOAN
- 77 RETURN OF STOCK LOAN
- 78 RERUTN OF WARRANTS
- 96 ACCOUNT TRANSFER
- 97 ARCHIVING
- 98 CORRECTION
- 99 OTHER
volume: transaction volume
price: transaction price
prccd: market closing price (sourced from COMPUSTAT)
ajexdi: adjustment factor for stock prices (sourced from COMPUSTAT)
cshoc: Number of shares on issue (sourced from COMPUSTAT)
trans_date: Date of settlement
trade_date: Date of trade

Import Data



In [3]:

    
data_main = pd.read_stata(Data_Folder + Datafile_Main)
data_main.columns = [var.lower() for var in data_main.columns]

# There are several datetime variables -> remove the unused ones
data_main = data_main.drop(['date', 'tr_date', 'trans_date'], axis=1)

# Convert trade_date to datetime
data_main['date'] = pd.to_datetime(data_main.trade_date)

date_main = data_main.sort_index(axis=1)


# Generate an index for multiple transactions within the same day to preserve their order
def add_seq(grp):
    """Add the variable 'seq' to indicate the order of transactions within a day"""
    grp['seq'] = np.arange(1, len(grp) + 1)
    return grp

data_main = data_main.groupby(by=['fund_id', 'isin', 'date'], as_index=False).apply(add_seq)

date_main = data_main.sort_values(by=['fund_id', 'isin', 'date', 'seq', 'buy_sell'])

# data_main = data_main.ix[0:10000]

data_main.head(10)









    Out[3]:






  
    
      
      fund_id
      isin
      acc
      sect
      owntype
      legtype
      ref_code
      buy_sell
      volume
      price
      ...
      country
      language
      prccd
      ajexdi
      cshoc
      trade_date
      class
      nom_ind
      date
      seq
    
  
  
    
      0
      FI0008803119
      FI0009000251
      1128458
      260
      1
      31
      1
      10
      1000
      228
      ...
      
      1
      227.00
      3.4137
      6124980
      1994-12-27
      FI
      0
      1994-12-27
      1
    
    
      1
      FI0008803119
      FI0009000582
      1128458
      260
      1
      31
      1
      10
      15000
      131
      ...
      
      1
      129.00
      1.0000
      82000000
      1994-12-29
      FI
      0
      1994-12-29
      1
    
    
      2
      FI0008803119
      FI0009000707
      1128458
      260
      1
      31
      1
      20
      600
      702
      ...
      
      1
      700.00
      8.0000
      37596439
      1994-12-29
      FI
      0
      1994-12-29
      1
    
    
      3
      FI0008803119
      FI0009000707
      1128458
      260
      1
      31
      1
      20
      200
      702
      ...
      
      1
      700.00
      8.0000
      37596439
      1994-12-29
      FI
      0
      1994-12-29
      2
    
    
      4
      FI0008803119
      FI0009000012
      1128458
      260
      1
      31
      0
      10
      196000
      0
      ...
      
      1
      5.53
      1.0000
      726845084
      1995-01-01
      FI
      0
      1995-01-01
      1
    
    
      5
      FI0008803119
      FI0009000053
      1128458
      260
      1
      31
      0
      10
      294000
      0
      ...
      
      1
      12.20
      1.0000
      416794279
      1995-01-01
      FI
      0
      1995-01-01
      1
    
    
      6
      FI0008803119
      FI0009000061
      1128458
      260
      1
      31
      0
      10
      150000
      0
      ...
      
      1
      11.80
      1.0000
      67200000
      1995-01-01
      FI
      0
      1995-01-01
      1
    
    
      7
      FI0008803119
      FI0009000145
      1128458
      260
      1
      31
      0
      10
      67000
      0
      ...
      
      1
      51.60
      3.0000
      20070000
      1995-01-01
      FI
      0
      1995-01-01
      1
    
    
      8
      FI0008803119
      FI0009000202
      1128458
      260
      1
      31
      0
      10
      90000
      0
      ...
      
      1
      55.00
      1.0000
      90213400
      1995-01-01
      FI
      0
      1995-01-01
      1
    
    
      9
      FI0008803119
      FI0009000251
      1128458
      260
      1
      31
      0
      10
      12500
      0
      ...
      
      1
      226.00
      3.4137
      6124980
      1995-01-01
      FI
      0
      1995-01-01
      1
    
  

10 rows × 24 columns

Descriptive Statistics

Numerical Variables



In [4]:

    
def sum_stats(col):
    """Produce a table of summary statistics for a numerical variable"""
    count = col.count()
    mean = col.mean()
    median = col.median()
    stddev = col.std()
    skewness = col.skew()
    kurtosis = col.kurt()
    quantile1 = col.quantile(0.25)
    quantile2 = col.quantile(0.5)
    quantile3 = col.quantile(0.75)
    first_obs_index = col.first_valid_index()
    last_obs_index = col.last_valid_index()
    
    return pd.Series([count, mean, median, stddev, skewness, kurtosis, quantile1, quantile2, quantile3, first_obs_index, last_obs_index],
                         index=['count', 'mean', 'median', 'stddev', 'skewness', 'kurtosis', 'quantile1', 'quantile2', 'quantile3', 'first_obs_index', 'last_obs_index']
                        )
    
print("Descriptive Statistics for Numerical Variables:")

stats = data_main[['volume', 'price', 'prccd', 'ajexdi', 'cshoc', 'date']].sort_values(by=['date']).drop('date', axis=1).apply(sum_stats).T

# Fix up the dates
stats['first_obs'] = data_main.date[stats.T.ix['first_obs_index']].values 
stats['last_obs'] = data_main.date[stats.T.ix['last_obs_index']].values 
stats









    



Descriptive Statistics for Numerical Variables:






    Out[4]:






  
    
      
      count
      mean
      median
      stddev
      skewness
      kurtosis
      quantile1
      quantile2
      quantile3
      first_obs_index
      last_obs_index
      first_obs
      last_obs
    
  
  
    
      volume
      290141
      1.724857e+04
      2.000000e+03
      8.851346e+04
      46.630421
      4273.024135
      500.00
      2.000000e+03
      1.000000e+04
      127041
      185236
      1994-12-19
      2014-12-26
    
    
      price
      290141
      4.633595e+01
      1.713000e+01
      8.849064e+01
      4.661467
      30.979461
      10.18
      1.713000e+01
      3.530000e+01
      127041
      185236
      1994-12-19
      2014-12-26
    
    
      prccd
      287501
      4.708490e+01
      1.737000e+01
      8.993040e+01
      4.693060
      31.570475
      10.40
      1.737000e+01
      3.600000e+01
      127041
      185236
      1994-12-19
      2014-12-26
    
    
      ajexdi
      287501
      2.164172e+00
      1.000000e+00
      3.867670e+00
      7.520359
      87.975780
      1.00
      1.000000e+00
      1.237900e+00
      127041
      185236
      1994-12-19
      2014-12-26
    
    
      cshoc
      287465
      4.049428e+08
      1.277820e+08
      8.712507e+08
      3.873649
      14.645906
      60844580.00
      1.277820e+08
      2.688760e+08
      127041
      185236
      1994-12-19
      2014-12-26

Categorical Variables



In [5]:

    
def_class = {
    'HH': 'Household',
    'FI': 'Financial institution (domestic)',
    'FR': 'Foreign with own account at euroclear',
    'FI': 'Foreign nominee account where most foreign holdings are registered',
    'NF': 'Non-financial',
    'GG': 'General government',
    'NP': 'Non-for-profit',
    'NA': "Not Available"
} 

def_legtype = {
    11 : 'NATURAL PERSON',
    12 : 'ESTATE OF A DECEASED PERSON',
    13 : 'TAXATION CONSORTIUM',
    14 : 'PARTNERSHIP',
    15 : "BANKRUPT'S ESTATE",
    21 : 'LIMITED PARTNERSHIP',
    22 : 'LAIVAISÄNNISTÖYHTIÖ (EI OSAKEYHTIÖ)',
    30 : 'PUBLIC LIMITED COMPANY',
    31 : 'LIMITED COMPANY',
    32 : 'MUTUAL INSURANCE COMPANY',
    33 : 'SAVINGS BANK',
    34 : 'PENSION TRUST OR FUND',
    35 : 'HOUSING CORPORATION',
    41 : 'COOPERATIVE',
    51 : 'FOUNDATION, FUND',
    52 : 'NONPROFIT ASSOCIATION',
    53 : 'MUTUAL INDEMNITY INSURANCE ASSOCIATION',
    54 : 'ECONOMIC ASSOCIATION',
    61 : 'PUBLIC AUTHORITY',
    62 : 'PUBLIC UTILITY',
    63 : 'PUBLIC CORPORATION',
    71 : 'STATE CHURCH',
    72 : 'OTHER RELIGIOUS COMMUNITY',
    90 : 'OTHER JURIDICAL FORM',
    'NA': "Not Available"
}

def_ref_code = {
    1 : 'MARKETPLACE TRADE',
    2 : 'OTHER TRADE',
    3 : 'TRANSFER TO BOOKENTRIES',
    4 : 'TRANSFER TO OWN ACCOUNT',
    5 : 'CONVERSION',
    6 : 'BEQUEST',
    7 : 'INHERITANCE',
    8 : 'WILL',
    9 : 'DIVISION OF HTE PROPERTY OF THE SPOU',
    10 : 'FORCED SALE',
    11 : 'TRANSFER INT. CERTIFICATE OF OWN',
    12 : 'TRANSFER OF INTERNAT. CERTIFICATE',
    13 : 'TRANSFER TO NOM.REG BE ACCOUNT',
    14 : 'TRANSFER FROM NOM.REG BE ACCOUNT',
    15 : 'TRADEBYTRADE SETTLEMENT',
    16 : 'DIRECT SELLING',
    17 : 'TRANSFER OF COLLATERAL',
    18 : 'RETURN OF COLLATERAL',
    19 : 'CANCELLATION OF RELATIONSHIP BASED O',
    20 : 'ISSUE ACCOUNT TRANSACTION',
    50 : 'RIGHTS ISSUE',
    51 : 'BONUS ISSUE',
    52 : 'SUBSCRIPTION',
    53 : 'FORFEITURE',
    54 : 'REDEMPTION',
    55 : 'SPLIT',
    56 : 'MERGER',
    57 : 'COMBINATION',
    58 : 'CANCELLATION OF A SECURITY',
    59 : 'OTHER SUBSCRIPTION',
    60 : 'SUBSCRIPTION RIGHT',
    61 : 'TRANSFER OF RECORD TIME',
    62 : 'SECURITY TYPE CONVERSION',
    63 : 'CANCELLING OF SECURITY TYPE LOAN',
    64 : 'REVERSE SPLIT',
    65 : 'CANCELLATION OF SUBSCRIPTION',
    71 : 'CONVERSION OF CONVERTIBLE BOND',
    72 : 'SUBSCRIPTION WITH WARRANT',
    73 : 'EXECUTION OF WARRANT',
    75 : 'DEMERGER',
    76 : 'STOCK LOAN',
    77 : 'RETURN OF STOCK LOAN',
    78 : 'RERUTN OF WARRANTS',
    96 : 'ACCOUNT TRANSFER',
    97 : 'ARCHIVING',
    98 : 'CORRECTION',
    99 : 'OTHER',
    0  : 'Not Available',
    'NA': "Not Available"
}



In [6]:

    
def tab(col, desc=None):
    """Create a simple tabluation table for a categorical variable."""
    obs = len(col)
    
    # convert .value_counts() to a DataFrame
    df = pd.DataFrame(col.value_counts()).reset_index()
    df.columns = ['code', 'obs']
    
    if desc is not None:
        df['desc'] = df.code.apply(lambda x: desc[x])
    else:
        df['desc'] = np.nan
        
    df.sort_index(axis=1, inplace=True)
    total_freq = np.sum(df.obs)
    df['percent'] = df['obs'] / total_freq * 100
    
    # report missing observations
    df.loc[len(df)] = [np.nan, 'non-missing obs; <-- divided by total obs;', col.count(), col.count()/obs*100]
    
    return df

class



In [7]:

    
var = 'class'
dct = def_class
print("Summary statistics of variable: {}".format(var))
tab(data_main[var], dct)









    



Summary statistics of variable: class






    Out[7]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      FI
      Foreign nominee account where most foreign hol...
      270883
      93.362538
    
    
      1
      NF
      Non-financial
      13026
      4.489541
    
    
      2
      NA
      Not Available
      6232
      2.147921
    
    
      3
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

legtype



In [8]:

    
var = 'legtype'
dct = def_legtype
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: legtype






    Out[8]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      51
      FOUNDATION, FUND
      155321
      53.532937
    
    
      1
      90
      OTHER JURIDICAL FORM
      100277
      34.561472
    
    
      2
      31
      LIMITED COMPANY
      18410
      6.345191
    
    
      3
      30
      PUBLIC LIMITED COMPANY
      8609
      2.967178
    
    
      4
      54
      ECONOMIC ASSOCIATION
      7524
      2.593222
    
    
      5
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

ref_code



In [9]:

    
var = 'ref_code'
dct = def_ref_code
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: ref_code






    Out[9]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      1
      MARKETPLACE TRADE
      225797
      77.823196
    
    
      1
      2
      OTHER TRADE
      57017
      19.651480
    
    
      2
      4
      TRANSFER TO OWN ACCOUNT
      2732
      0.941611
    
    
      3
      76
      STOCK LOAN
      1004
      0.346039
    
    
      4
      77
      RETURN OF STOCK LOAN
      977
      0.336733
    
    
      5
      96
      ACCOUNT TRANSFER
      454
      0.156476
    
    
      6
      99
      OTHER
      416
      0.143379
    
    
      7
      55
      SPLIT
      390
      0.134417
    
    
      8
      98
      CORRECTION
      287
      0.098917
    
    
      9
      57
      COMBINATION
      247
      0.085131
    
    
      10
      0
      Not Available
      195
      0.067209
    
    
      11
      51
      BONUS ISSUE
      159
      0.054801
    
    
      12
      56
      MERGER
      156
      0.053767
    
    
      13
      50
      RIGHTS ISSUE
      98
      0.033777
    
    
      14
      52
      SUBSCRIPTION
      97
      0.033432
    
    
      15
      60
      SUBSCRIPTION RIGHT
      27
      0.009306
    
    
      16
      62
      SECURITY TYPE CONVERSION
      24
      0.008272
    
    
      17
      75
      DEMERGER
      24
      0.008272
    
    
      18
      58
      CANCELLATION OF A SECURITY
      17
      0.005859
    
    
      19
      15
      TRADEBYTRADE SETTLEMENT
      9
      0.003102
    
    
      20
      3
      TRANSFER TO BOOKENTRIES
      7
      0.002413
    
    
      21
      64
      REVERSE SPLIT
      3
      0.001034
    
    
      22
      16
      DIRECT SELLING
      2
      0.000689
    
    
      23
      7
      INHERITANCE
      2
      0.000689
    
    
      24
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

buy_sell



In [10]:

    
def_buy_sell = {10: "buy", 20: "sell",
                11: "what is this?", 
                21: "Not sure what is this?"}

var = 'buy_sell'
dct = def_buy_sell
print("Summary statistics of variable: {}".format(var))

warn("Code 11, Code 21's definitions are unclear.")

tab(data_main[var], dct)









    



Summary statistics of variable: buy_sell






    



/opt/anaconda/lib/python3.4/site-packages/ipykernel/__main__.py:9: UserWarning: Code 11, Code 21's definitions are unclear.






    Out[10]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      10
      buy
      150573
      51.896492
    
    
      1
      20
      sell
      139557
      48.099717
    
    
      2
      11
      what is this?
      7
      0.002413
    
    
      3
      21
      Not sure what is this?
      4
      0.001379
    
    
      4
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

owntype



In [11]:

    
var = 'owntype'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: owntype






    Out[11]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      1
      NaN
      290141
      100
    
    
      1
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100

sect



In [12]:

    
var = 'sect'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: sect






    Out[12]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      123000
      NaN
      135802
      46.805519
    
    
      1
      260
      NaN
      127215
      43.845923
    
    
      2
      111211
      NaN
      8063
      2.778994
    
    
      3
      122200
      NaN
      7866
      2.711096
    
    
      4
      124100
      NaN
      5149
      1.774654
    
    
      5
      121
      NaN
      4963
      1.710548
    
    
      6
      111020
      NaN
      1083
      0.373267
    
    
      7
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

currency



In [13]:

    
var = 'currency'
dct = None
print("Summary statistics of variable: {}".format(var))

print('20% of transactions are based in currency FIM.')

tab(data_main[var], dct)









    



Summary statistics of variable: currency
20% of transactions are based in currency FIM.






    Out[13]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      EUR
      NaN
      231290
      79.716414
    
    
      1
      FIM
      NaN
      58851
      20.283586
    
    
      2
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

byear



In [14]:

    
var = 'byear'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: byear






    Out[14]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      NaN
      non-missing obs; <-- divided by total obs;
      0
      0

gender



In [15]:

    
var = 'gender'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: gender






    Out[15]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      1
      NaN
      102049
      77.205738
    
    
      1
      2
      NaN
      30129
      22.794262
    
    
      2
      NaN
      non-missing obs; <-- divided by total obs;
      132178
      45.556471

pcode



In [16]:

    
var = 'pcode'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: pcode






    Out[16]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      101
      NaN
      67780
      23.361055
    
    
      1
      100
      NaN
      60932
      21.000824
    
    
      2
      510
      NaN
      56073
      19.326121
    
    
      3
      20
      NaN
      46471
      16.016695
    
    
      4
      131
      NaN
      25457
      8.774010
    
    
      5
      75
      NaN
      21277
      7.333331
    
    
      6
      2650
      NaN
      6036
      2.080368
    
    
      7
      180
      NaN
      3110
      1.071893
    
    
      8
      120
      NaN
      3005
      1.035703
    
    
      9
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

country



In [17]:

    
var = 'country'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: country






    Out[17]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      
      NaN
      247761
      85.393309
    
    
      1
      FI
      NaN
      42380
      14.606691
    
    
      2
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

language



In [18]:

    
var = 'language'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: language






    Out[18]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      1
      NaN
      237363
      81.809534
    
    
      1
      2
      NaN
      52778
      18.190466
    
    
      2
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100.000000

nom_ind



In [19]:

    
var = 'nom_ind'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)









    



Summary statistics of variable: nom_ind






    Out[19]:






  
    
      
      code
      desc
      obs
      percent
    
  
  
    
      0
      0
      NaN
      290141
      100
    
    
      1
      NaN
      non-missing obs; <-- divided by total obs;
      290141
      100

Descriptive Stats by Funds



In [20]:

    
grouped = data_main.groupby('fund_id', as_index=False, axis=0)

list_of_numerical_vars = ['volume', 'price', 'prccd', 'ajexdi', 'cshoc', 'date']
list_of_categorical_vars = ['class', 'legtype', 'ref_code', 'buy_sell', 'sect', 'gender', 'pcode']
dict_of_categorical_vars = {'class': def_class, 
                            'legtype': def_legtype, 
                            'ref_code': def_ref_code, 
                            'buy_sell': def_buy_sell, 
                            'sect': None, 
                            'gender': None, 
                            'pcode': None}



In [21]:

    
for name, group in grouped:
    print("**********  Fund: {}  **********".format(name))
    print("Numerical variables:")
    print(group[list_of_numerical_vars].describe().T)
    print("\n")
    
    print("Categorical variables:")
    for var in list_of_categorical_vars:
        print("- Variable: {}".format(var))
        print(tab(group[var], dict_of_categorical_vars[var]).to_string(index=False))
        print("\n")
    print("\n")
    print("\n")









    



**********  Fund: FI0008800016  **********
Numerical variables:
        count          mean           std          min            25%  \
volume  18926  2.247476e+04  1.204354e+05       1.0000       543.0000   
price   18926  4.297073e+01  7.806916e+01       0.0000        10.5100   
prccd   18682  4.400443e+01  8.020642e+01       0.1200        10.8525   
ajexdi  18682  1.920676e+00  3.416968e+00       0.1144         1.0000   
cshoc   18680  4.214782e+08  8.997133e+08  477836.0000  53150000.0000   

                 50%           75%         max  
volume  2.700000e+03  1.400000e+04     7662900  
price   1.937500e+01  3.300000e+01         990  
prccd   1.998000e+01  3.310000e+01         990  
ajexdi  1.000000e+00  1.231800e+00          64  
cshoc   1.246307e+08  2.670158e+08  4796292460  


Categorical variables:
- Variable: class
code                                               desc    obs     percent
  FI  Foreign nominee account where most foreign hol...  17471   92.312163
  NA                                      Not Available   1455    7.687837
 NaN         non-missing obs; <-- divided by total obs;  18926  100.000000


- Variable: legtype
 code                                        desc    obs  percent
   51                            FOUNDATION, FUND  18926      100
  NaN  non-missing obs; <-- divided by total obs;  18926      100


- Variable: ref_code
 code                                        desc    obs     percent
    1                           MARKETPLACE TRADE  14877   78.606150
    2                                 OTHER TRADE   3283   17.346507
    4                     TRANSFER TO OWN ACCOUNT    503    2.657720
   99                                       OTHER    143    0.755574
   55                                       SPLIT     34    0.179647
    0                               Not Available     18    0.095107
   57                                 COMBINATION     16    0.084540
   56                                      MERGER     12    0.063405
   51                                 BONUS ISSUE     12    0.063405
   52                                SUBSCRIPTION      9    0.047554
   98                                  CORRECTION      6    0.031702
   50                                RIGHTS ISSUE      6    0.031702
   62                    SECURITY TYPE CONVERSION      3    0.015851
   75                                    DEMERGER      2    0.010567
   60                          SUBSCRIPTION RIGHT      1    0.005284
    3                     TRANSFER TO BOOKENTRIES      1    0.005284
  NaN  non-missing obs; <-- divided by total obs;  18926  100.000000


- Variable: buy_sell
 code                                        desc    obs     percent
   10                                         buy  10127   53.508401
   20                                        sell   8794   46.465180
   11                               what is this?      3    0.015851
   21                      Not sure what is this?      2    0.010567
  NaN  non-missing obs; <-- divided by total obs;  18926  100.000000


- Variable: sect
   code                                        desc    obs     percent
    260                                         NaN   9221   48.721336
 123000                                         NaN   8250   43.590827
 124100                                         NaN   1455    7.687837
    NaN  non-missing obs; <-- divided by total obs;  18926  100.000000


- Variable: gender
 code                                        desc   obs     percent
    1                                         NaN  9221  100.000000
  NaN  non-missing obs; <-- divided by total obs;  9221   48.721336


- Variable: pcode
 code                                        desc    obs  percent
   20                                         NaN  18926      100
  NaN  non-missing obs; <-- divided by total obs;  18926      100






**********  Fund: FI0008801048  **********
Numerical variables:
        count          mean           std           min          25%  \
volume  22564  1.769483e+04  8.403565e+04        1.0000      1000.00   
price   22564  3.009440e+01  6.909619e+01        0.0000         6.04   
prccd   21933  3.113174e+01  7.107623e+01        0.1100         6.24   
ajexdi  21933  1.829756e+00  2.836307e+00        0.0526         1.00   
cshoc   21931  3.094699e+08  6.305392e+08  1070000.0000  42550715.00   

                 50%           75%         max  
volume  5.000000e+03  1.616050e+04     4850000  
price   1.140500e+01  2.200000e+01        1210  
prccd   1.175000e+01  2.290000e+01        1210  
ajexdi  1.000000e+00  1.101700e+00          64  
cshoc   1.254277e+08  2.564037e+08  3800942739  


Categorical variables:
- Variable: class
code                                               desc    obs  percent
  FI  Foreign nominee account where most foreign hol...  22564      100
 NaN         non-missing obs; <-- divided by total obs;  22564      100


- Variable: legtype
 code                                        desc    obs  percent
   51                            FOUNDATION, FUND  22564      100
  NaN  non-missing obs; <-- divided by total obs;  22564      100


- Variable: ref_code
 code                                        desc    obs     percent
    1                           MARKETPLACE TRADE  11276   49.973409
    2                                 OTHER TRADE  10974   48.634994
   99                                       OTHER    118    0.522957
    4                     TRANSFER TO OWN ACCOUNT     49    0.217160
   98                                  CORRECTION     45    0.199433
   57                                 COMBINATION     21    0.093069
   55                                       SPLIT     18    0.079773
    0                               Not Available     17    0.075341
   56                                      MERGER     12    0.053182
   50                                RIGHTS ISSUE      9    0.039887
   52                                SUBSCRIPTION      8    0.035455
   51                                 BONUS ISSUE      7    0.031023
   60                          SUBSCRIPTION RIGHT      4    0.017727
   58                  CANCELLATION OF A SECURITY      3    0.013296
   75                                    DEMERGER      2    0.008864
   16                              DIRECT SELLING      1    0.004432
  NaN  non-missing obs; <-- divided by total obs;  22564  100.000000


- Variable: buy_sell
 code                                        desc    obs     percent
   10                                         buy  12290   54.467293
   20                                        sell  10273   45.528275
   11                               what is this?      1    0.004432
  NaN  non-missing obs; <-- divided by total obs;  22564  100.000000


- Variable: sect
   code                                        desc    obs     percent
    260                                         NaN  10838   48.032264
 122200                                         NaN   7866   34.860840
 123000                                         NaN   3860   17.106896
    NaN  non-missing obs; <-- divided by total obs;  22564  100.000000


- Variable: gender
 code                                        desc    obs     percent
    1                                         NaN  10838  100.000000
  NaN  non-missing obs; <-- divided by total obs;  10838   48.032264


- Variable: pcode
 code                                        desc    obs  percent
  101                                         NaN  22564      100
  NaN  non-missing obs; <-- divided by total obs;  22564      100






**********  Fund: FI0008801071  **********
Numerical variables:
        count          mean           std        min          25%         50%  \
volume  18534  8.174622e+03  4.299689e+04       1.00       383.00      1000.0   
price   18534  4.514100e+01  9.079845e+01       0.00         8.90        15.0   
prccd   18510  4.587253e+01  9.239912e+01       0.17         8.98        15.2   
ajexdi  18510  2.155561e+00  3.901301e+00       0.10         1.00         1.0   
cshoc   18507  2.929307e+08  7.246102e+08  415337.00  36340550.00  88163370.0   

                 75%         max  
volume  3.815750e+03     1735245  
price   3.350000e+01        1630  
prccd   3.400000e+01        1630  
ajexdi  1.274500e+00          64  
cshoc   2.193326e+08  4796292460  


Categorical variables:
- Variable: class
code                                               desc    obs  percent
  FI  Foreign nominee account where most foreign hol...  18534      100
 NaN         non-missing obs; <-- divided by total obs;  18534      100


- Variable: legtype
 code                                        desc    obs  percent
   51                            FOUNDATION, FUND  18534      100
  NaN  non-missing obs; <-- divided by total obs;  18534      100


- Variable: ref_code
 code                                        desc    obs     percent
    1                           MARKETPLACE TRADE  17201   92.807813
    2                                 OTHER TRADE   1058    5.708428
    4                     TRANSFER TO OWN ACCOUNT     91    0.490990
   55                                       SPLIT     40    0.215820
   57                                 COMBINATION     26    0.140283
   51                                 BONUS ISSUE     25    0.134887
    0                               Not Available     24    0.129492
   62                    SECURITY TYPE CONVERSION     12    0.064746
   99                                       OTHER     11    0.059350
   52                                SUBSCRIPTION     10    0.053955
   56                                      MERGER     10    0.053955
   98                                  CORRECTION      9    0.048559
   50                                RIGHTS ISSUE      8    0.043164
   60                          SUBSCRIPTION RIGHT      2    0.010791
    3                     TRANSFER TO BOOKENTRIES      2    0.010791
   58                  CANCELLATION OF A SECURITY      2    0.010791
   75                                    DEMERGER      2    0.010791
   16                              DIRECT SELLING      1    0.005395
  NaN  non-missing obs; <-- divided by total obs;  18534  100.000000


- Variable: buy_sell
 code                                        desc    obs     percent
   10                                         buy  10918   58.907953
   20                                        sell   7616   41.092047
  NaN  non-missing obs; <-- divided by total obs;  18534  100.000000


- Variable: sect
   code                                        desc    obs     percent
    260                                         NaN  10228   55.185065
 123000                                         NaN   8306   44.814935
    NaN  non-missing obs; <-- divided by total obs;  18534  100.000000


- Variable: gender
 code                                        desc    obs     percent
    1                                         NaN  10228  100.000000
  NaN  non-missing obs; <-- divided by total obs;  10228   55.185065


- Variable: pcode
 code                                        desc    obs  percent
  100                                         NaN  18534      100
  NaN  non-missing obs; <-- divided by total obs;  18534      100






**********  Fund: FI0008801485  **********
Numerical variables:
        count          mean           std           min          25%  \
volume  27545  7.624694e+03  5.514410e+04        1.0000       400.00   
price   27545  3.251798e+01  5.901595e+01        0.0000        11.15   
prccd   27272  3.313295e+01  6.073064e+01        0.2500        11.41   
ajexdi  27272  1.637745e+00  2.828950e+00        0.1144         1.00   
cshoc   27270  3.391632e+08  7.951268e+08  1080000.0000  48500125.00   

                 50%           75%         max  
volume  1.500000e+03  5.000000e+03     4637500  
price   1.730000e+01  2.964000e+01         990  
prccd   1.745000e+01  2.990000e+01         990  
ajexdi  1.000000e+00  1.000000e+00          64  
cshoc   1.073587e+08  1.875400e+08  4796292460  **OUTPUT MUTED**

Descriptive Stats by Year



In [22]:

    
# Resampling from daily to annually
# How to down-sample from daily to annually? Some variables are computed as means; some are aggregated.
vars_compute_average = ['price', 'prccd', 'ajexdi', 'cshoc']
vars_compute_sum = ['buy_value', 'sell_value', 'buy_volume', 'sell_volume', 'buy', 'sell']



In [23]:

    
ts_main = data_main
# Generate variables
ts_main['buy'] = (ts_main.buy_sell == 10) | (ts_main.buy_sell == 11)
ts_main['sell'] = (ts_main.buy_sell == 20) | (ts_main.buy_sell == 21)

ts_main['buy_volume'] = np.nan
ts_main.ix[ts_main['buy']==True, 'buy_volume'] = ts_main['volume']
ts_main['sell_volume'] = np.nan
ts_main.ix[ts_main['sell']==True, 'sell_volume'] = ts_main['volume']

ts_main['buy_value'] = np.nan
ts_main.ix[ts_main['buy']==True, 'buy_value'] = ts_main['volume'] * ts_main['price']
ts_main['sell_value'] = np.nan
ts_main.ix[ts_main['sell']==True, 'sell_value'] = ts_main['volume'] * ts_main['price']

ts_main.index = ts_main.date



In [24]:

    
ts_temp_mean = ts_main[vars_compute_average].resample('A', how='mean')
ts_temp_sum = ts_main[vars_compute_sum].resample('A', how='sum')
ts_annual = pd.concat([ts_temp_mean, ts_temp_sum], axis=1)
ts_annual.columns = ['avg_price', 'avg_prccd', 'avg_ajexdi', 'avg_cshoc', 'sum_buy_value', 'sum_sell_value', 'sum_buy_vol', 'sum_sell_vol', 'sum_buys', 'sum_sells']
print('Variable definitions:')
print('sum_buys: number of buys during the year')
print('sum_buy_vol: total buy volume during the year')
print('sum_buy_value (in 000,000s): total turonver of purchase value during the year')
ts_annual['sum_buy_value'] = ts_annual['sum_buy_value']/1000000 
ts_annual['sum_sell_value'] = ts_annual['sum_sell_value']/1000000 
ts_annual









    



Variable definitions:
sum_buys: number of buys during the year
sum_buy_vol: total buy volume during the year
sum_buy_value (in 000,000s): total turonver of purchase value during the year






    Out[24]:






  
    
      
      avg_price
      avg_prccd
      avg_ajexdi
      avg_cshoc
      sum_buy_value
      sum_sell_value
      sum_buy_vol
      sum_sell_vol
      sum_buys
      sum_sells
    
    
      date
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1994-12-31
      153.809259
      150.693889
      6.528385
      1.104380e+08
      35.499000
      26.813600
      567400
      589800
      78
      84
    
    
      1995-12-31
      138.213604
      141.753156
      6.147319
      1.234849e+08
      3683.562927
      3833.692808
      69590197
      59328435
      6512
      7352
    
    
      1996-12-31
      116.350465
      116.608281
      5.765767
      1.188861e+08
      3784.796712
      4072.432073
      57469672
      58546592
      7438
      8747
    
    
      1997-12-31
      161.441175
      161.995295
      4.636617
      1.342176e+08
      4616.198175
      4483.420826
      59152263
      55704503
      6818
      6925
    
    
      1998-12-31
      208.063459
      210.618799
      4.112164
      1.912685e+08
      5439.818231
      5210.158386
      52115245
      48363473
      7748
      7265
    
    
      1999-12-31
      34.303839
      34.479339
      2.675440
      3.979546e+08
      897.252070
      899.840589
      49069080
      46026064
      7576
      7739
    
    
      2000-12-31
      33.533031
      33.803717
      1.948758
      5.694018e+08
      1437.123104
      1316.744530
      73429956
      62885140
      6949
      6682
    
    
      2001-12-31
      13.694342
      13.775393
      1.569920
      5.771282e+08
      734.365436
      654.063640
      80264119
      59810924
      5268
      4839
    
    
      2002-12-31
      13.404859
      13.470798
      1.533537
      8.072455e+08
      629.125871
      590.531031
      63529511
      65233958
      5174
      4418
    
    
      2003-12-31
      11.696425
      11.695220
      1.482651
      8.290779e+08
      588.850666
      442.608437
      71239436
      55856913
      6605
      4442
    
    
      2004-12-31
      12.495043
      13.001033
      1.570942
      5.861513e+08
      722.641738
      601.799095
      70193837
      133266157
      5059
      2740
    
    
      2005-12-31
      17.840684
      18.156664
      1.550864
      3.987343e+08
      733.881262
      739.963005
      52969789
      56608478
      2330
      1996
    
    
      2006-12-31
      18.392183
      19.190453
      1.303226
      3.907589e+08
      228.230178
      229.494766
      14859885
      16878252
      478
      493
    
    
      2007-12-31
      23.090562
      23.103537
      1.246037
      3.367412e+08
      2350.847580
      2367.310627
      135339817
      129138045
      21024
      15950
    
    
      2008-12-31
      17.967013
      17.942003
      1.175673
      4.231516e+08
      2557.008426
      2484.484775
      190018581
      172309766
      20855
      18774
    
    
      2009-12-31
      12.137364
      12.161051
      1.203043
      4.793680e+08
      2346.552330
      2049.202083
      277441903
      227760326
      15264
      13127
    
    
      2010-12-31
      14.542960
      14.580589
      1.136956
      3.698097e+08
      2219.364566
      2092.553943
      201083780
      183930126
      6047
      5846
    
    
      2011-12-31
      15.275010
      15.217217
      1.107915
      4.156995e+08
      2086.850055
      2025.551047
      204763850
      183538495
      5758
      5890
    
    
      2012-12-31
      13.666824
      14.376986
      1.124224
      4.621582e+08
      1993.079458
      1904.954811
      336439421
      283760594
      5470
      6552
    
    
      2013-12-31
      16.170372
      16.891658
      1.047625
      4.710433e+08
      3043.694423
      3103.375720
      382381613
      398314857
      5600
      6958
    
    
      2014-12-31
      14.727076
      15.629244
      1.013533
      4.705342e+08
      1181.346339
      1396.723934
      131832665
      132913061
      2529
      2742

Plots

Time Series Plots - Number of Buys / Sells per Fund



In [25]:

    
ts_buys = data_main[['fund_id', 'buy', 'sell', 'buy_value', 'sell_value', 'date']].copy()
ts_buys.index = ts_buys.date
ts_buys.drop('date', axis=1, inplace=True)



In [26]:

    
# Group by fund_id
ts_buys_grouped = ts_buys.groupby('fund_id', as_index=False, axis=0)
i = 1
for name, group in ts_buys_grouped:
    # Resample from daily
    ts_temp = group.drop('fund_id', axis=1).resample('M', how='sum')
    
    # Plot
    plt.figure(i)
    plt.plot(ts_temp.index, ts_temp.buy, 'b', label='number of buys')
    plt.plot(ts_temp.index, ts_temp.sell, 'r', label='number of sells')
    plt.legend(loc='upper left')
    plt.xlabel('Time')
    plt.ylabel('Freq')
    plt.grid(True)
    plt.title('Number of Buys/Sells Per Month for Fund {}'.format(name))
    i += 1
    
plt.show()

Time Series Plots - Turnover of Buys/Sells per Fund



In [27]:

    
# Group by fund_id
i = 1
for name, group in ts_buys_grouped:
    # Resample from daily
    ts_temp = group.drop('fund_id', axis=1).resample('M', how='sum')
    
    # Plot
    plt.figure(i)
    plt.plot(ts_temp.index, ts_temp.buy_value, 'b', label='total value of buys')
    plt.plot(ts_temp.index, ts_temp.sell_value, 'r', label='total value of sells')
    plt.legend(loc='upper left')
    plt.xlabel('Time')
    plt.ylabel('Value in Reported Currency')
    plt.grid(True)
    plt.title('Value of Buys/Sells Per Month for Fund {}'.format(name))
    i += 1
    
plt.show()

Calculating Daily Holdings using FIFO

This method makes use of First-In-First-Out principle to calculate stock returns. Cash dividends can be assumed to be fully reinvested (see the switches at the beginning of the program).



In [ ]:

    
"""
Module: Calculate First-In-First-Out (FIFO) profits from account transaction data

Author: Peter Lee (mr.peter.lee@hotmail.com)
Last update: 2016-Jan-15

This module calculates the investment profit/loss of each sell in the account transaction history using the First-In-First-Out (FIFO) method.

A number of checks are in place to improve data quality:
    1.) Cash dividends are assumed to be fully reinvested by default.
    2.)
"""
##
import numpy as np
import pandas as pd
import os.path


def check_missing_in_data_stocks(data_stocks):
    """
    Remove missing entrieces from the End-Of-Day stocks data.
    Notes:
        The variables in this DataFrame are:
        ['isin', 'date', 'ajexdi', 'div', 'prccd']
    """
    missing_isin = len(data_stocks[data_stocks['isin'] == ""])
    print("Number of missing isin entries in the stocks price data: {}".format(missing_isin))
    print("These missing entries are dropped.")
    data_stocks.drop(data_stocks.index[data_stocks['isin'] == ""], inplace=True)
    print("\n")
    # Checking zero dividend payments
    print("Number of zero-dividends paymetns: {}".format(np.sum(data_stocks['div'] == 0)))
    print("zero-dividend payments are replaced as missing.")
    data_stocks.ix[data_stocks['div'] == 0, 'div'] = np.nan
    print("\n")
    # Check missing prccd
    print("Number of missing prccd entries: {}".format(np.sum(np.isnan(data_stocks['prccd']))))
    print("They are dropped")
    data_stocks.drop(data_stocks.index[np.isnan(data_stocks['prccd'])], inplace=True)
    print("\n")
    # Check missing ajexdi
    print("Number of missing ajexdi entries: {}".format(np.sum(np.isnan(data_stocks['ajexdi']))))
    print("They are dropped")
    data_stocks.drop(data_stocks.index[np.isnan(data_stocks['ajexdi'])], inplace=True)
    return data_stocks


def add_div_index(group):
    """
    Added a dividend index column to the group (isin EOD df).
    """
    # Replace NANs in "div" as ZERO
    group.ix[group['div'].isnull(), 'div'] = 0
    group['divIndex'] = (group['div'] / group['prccd'].shift(1) + 1).cumprod()
    # Set the value in the first row
    group.ix[group.index[0], 'divIndex'] = 1
    return group


def compute_dividend_index(data_stocks):
    """
    Calculate the dividend index to show the cumulative effects of dividends payments. The index begins with 1.
    """
    data_stocks['divIndex'] = np.nan
    data_stocks.sort_values(by=['isin', 'date'], inplace=True)
    return data_stocks.groupby(by='isin', as_index=False).apply(add_div_index)


def add_mod_ajexdi(group):
    """
    Added a modified ajexdi column to the group
    """
    # Calculate the change factor
    group['change_factor'] = group['ajexdi'] / group['ajexdi'].shift(1) /\
        group['divIndex'] * group['divIndex'].shift(1)
    group.ix[group.index[0], 'change_factor'] = group.ix[group.index[0], 'ajexdi']
    group['modAjexdi'] = group['change_factor'].cumprod()
    group.drop('change_factor', axis=1, inplace=True)
    return group


def calculate_mod_ajexdi(data_stocks):
    """
    Calculate the modified ajexdi from divIndex per isin.
    """
    return data_stocks.groupby(by='isin', as_index=False).apply(add_mod_ajexdi)


def data_main_add_seq(grp):
    """Add the variable 'seq' to indicate the order of transactions within a day"""
    # Generate an index for multiple transactions within the same day to preserve their order
    grp['seq'] = np.arange(1, len(grp) + 1)
    return grp


def data_main_fill_na(grp):
    lst_vars = ['modAjexdi', 'prccd', 'divIndex']
    for x in lst_vars:
        grp[x] = grp[x].fillna(method='ffill')
        grp[x] = grp[x].fillna(method='bfill')
    return grp


def calculate_fifo_profit(data_main):
    """
    Correct any "over-sale" transactions.
    Calculate FIFO profits.
    """
    return data_main.groupby(by=['fund_id', 'isin'], as_index=False).apply(data_main_add_profit)


def data_main_add_profit(group):
    """
    Add a new column - FIFO profit; while correct any "over-sale" transactions.
    """

    # Setting up some test data
    # gp = data_main.groupby(by=['fund_id', 'isin'], as_index=False)
    # temp = gp.get_group(('FI0008804422', 'FI0009000707'))
    # group = temp.copy()
    # group = temp.ix[fuck.index[:15], ['fund_id', 'isin', 'date', 'buy_sell', 'volume', 'price', 'modAjexdi']]
    # group

    # Identify all buys
    buys = group.ix[(group['buy_sell'] == 10) | (group['buy_sell'] == 11), ['date', 'price', 'volume', 'modAjexdi']]

    # position (stock volume)
    group['position'] = np.nan  # Stock volumes in position
    group['trans'] = np.nan  # Transaction volume after correcting over-sale
    group['profit'] = 0  # Fifo-profit
    # Calculate positions
    previous_ajexdi = None
    previous_position = None
    position = 0
    ajexdi = 0
    volume = 0
    order = None

    # ith_sell = 1
    for i, row in group.iterrows():
        ajexdi = row['modAjexdi']
        volume = row['volume']
        # Identify order type
        if (row['buy_sell'] == 10) | (row['buy_sell'] == 11):
            order = 1
        elif (row['buy_sell'] == 20) | (row['buy_sell'] == 21):
            order = -1
        else:
            order = 0

        # 1st row
        if previous_position is None:
            # if there is a sell or no trans
            group.ix[i, 'trans'] = 0
            group.ix[i, 'position'] = 0
            previous_position = 0
            previous_ajexdi = row['modAjexdi']
            # if there is a buy
            if (order == 1):
                group.ix[i, 'trans'] = volume
                group.ix[i, 'position'] = volume
                previous_position = row['volume']
            continue

        # >1st row
        # Convert position before buy/sell
        position = previous_position / ajexdi * previous_ajexdi

        # Check if there is a over-sale
        if (order == -1) & (volume > position):
                volume = position

        # Update position after buy/sell
        position += volume * order
        group.ix[i, 'trans'] = (volume if order >= 0 else volume * order)
        group.ix[i, 'position'] = position
        # Update previous_row
        previous_ajexdi = row['modAjexdi']
        previous_position = position

        # Calculate profit if there is a sell
        if (order == -1):
            # print('Sell {} at index {}'.format(ith_sell, row.name))
            # ith_sell += 1
            sell_revenue = volume * row['price']
            sell_index = row.name
            buy_index = 0
            purchase_cost = 0
            # The correct sell volume is "volume"
            # k = 1
            while volume >= 1 and buy_index < sell_index:
                # print('k is', k)
                # print(buys)
                # k += 1
                for j, buy in buys.iterrows():
                    buy_index = buy.name
                    buy_volume_at_sell = buy['volume'] * buy['modAjexdi'] / row['modAjexdi']
                    # print('j is', j, 'Cvt_buy_vol', buy_volume_at_sell)
                    if volume > buy_volume_at_sell:
                        volume -= buy_volume_at_sell
                        buys.ix[j, 'volume'] = 0
                        purchase_cost += buy['volume'] * buy['price']
                    else:
                        offset_vol = (volume * row['modAjexdi'] / buy['modAjexdi'])
                        buys.ix[j, 'volume'] = buy['volume'] - offset_vol
                        purchase_cost += offset_vol * buy['price']
                        volume = 0

            group.ix[i, 'profit'] = sell_revenue - purchase_cost
    return group


def expand_data(data_main):
    """
    Expand the holdings dataset so that that there is one End-of-Day entry for the entire holding period.
    """
    data_main = data_main.groupby(by=['fund_id', 'isin'], as_index=False).apply(add_eod_rows_to_holding_period)
    data_main.reset_index(drop=True, inplace=True)
    return data_main


def add_eod_rows_to_holding_period(group):

    # Create time series
    group.index = group['date']
    group.drop('date', axis=1, inplace=True)

    # Resample to End-of-Day data
    transform = {
        'agg_volume': group['volume'].resample('D', how='sum', fill_method='ffill', label='right', closed='right'),
        'num_acts': group['buy_sell'].resample('D', how='count', fill_method='ffill', label='right', closed='right'),
        'avg_price': group['price'].resample('D', how='mean', fill_method='ffill', label='right', closed='right'),
        'avg_ajexdi': group['modAjexdi'].resample('D', how='mean', fill_method='ffill', label='right', closed='right'),
        'position': group['position'].resample('D', how='last', fill_method='ffill', label='right', closed='right'),
        'trans': group['trans'].resample('D', how='sum', fill_method='ffill', label='right', closed='right'),
        'profit': group['profit'].resample('D', how='sum', fill_method='ffill', label='right', closed='right')
    }

    # Up-sample
    ts_daily = pd.DataFrame(transform)

    # Deal with string variables specifically
    lst_fill = ['fund_id', 'isin']
    for x in lst_fill:
        ts_daily[x] = group.ix[0, x]

    ts_daily.reset_index(inplace=True)
    ts_daily.rename(columns={'index': 'date'}, inplace=True)
    ts_daily.head()

    return ts_daily
##

def main(
    data_folder = "data/new/",
    output_folder = "data/output/"
    filename_transactions = "fin_mf_trades_1995_2014_fid.dta", 
    filename_stocks = "compustat_finland_1990_2015_all.dta",
    Balanced_Panel = True,
    debug = True,
    debug_file = "test1.xlsx",
    Account_For_Dividend = True,
    Adjust_Price_Before_1999 = True):

    if debug:
        test_folder = "data/tests/"

        # Test 1
        test_file = "test1.xlsx"
        test = pd.read_excel(os.path.join(test_folder, test_file))

        test = calculate_fifo_profit(test)
        test.to_excel(os.path.join(test_folder, test_file.split('.')[0] + "_output.xlsx"))
        test = expand_data(test)
        test.to_excel(os.path.join(test_folder, test_file.split('.')[0] + "_output2.xlsx"))

    # Loading data_stocks
    # ===================
    data_stocks = pd.read_stata(os.path.join(data_folder, filename_stocks))
    data_stocks['date'] = pd.to_datetime(data_stocks['date'], format="%Y%m%d")
    data_stocks = data_stocks[['isin', 'date', 'ajexdi', 'div', 'prccd']]

    # Check missing data in the data_stocks
    data_stocks = check_missing_in_data_stocks(data_stocks)
    # Calculate cumulative dividend index
    data_stocks = compute_dividend_index(data_stocks)
    # Modify ajexdi to include dividend payment (the original ajexdi is assumed to be net of dividends)
    if Account_For_Dividend:
        data_stocks = calculate_mod_ajexdi(data_stocks)
    else:
        data_stocks['modAdjexi'] = data_stocks['ajexdi']

##
    # Load data_transactions
    # ======================
    data_main = pd.read_stata(os.path.join(data_folder, filename_transactions))

    data_main.columns = [var.lower() for var in data_main.columns]
    # There are several datetime variables -> remove the unused ones
    data_main = data_main.drop(['date', 'tr_date', 'trans_date'], axis=1)
    # Convert trade_date to datetime
    data_main['date'] = pd.to_datetime(data_main.trade_date)
    date_main = data_main.sort_index(axis=1)
    data_main = data_main.groupby(by=['fund_id', 'isin', 'date'], as_index=False).apply(data_main_add_seq)
    date_main = data_main.sort_values(by=['fund_id', 'isin', 'date', 'seq', 'buy_sell'])
    if Adjust_Price_Before_1999:
        data_main.ix[data_main['date'] < "1999-Jan-01", 'price'] = data_main['price'] / 5.94573
    # Merge data_stocks with data_transactions, and impute missing prices.
    data_main.drop('prccd', inplace=True, axis=1)
    data_main = data_main.merge(data_stocks, how='left', on=['isin', 'date'])
    data_main = data_main[['fund_id', 'isin', 'date', 'seq', 'acc', 'sect', 'owntype', 'legtype', 'ref_code', 'buy_sell', 'volume', 'price', 'prccd', 'divIndex', 'modAjexdi', 'div']]
    data_main.head()
##

    # Inspect missing data
    # check if there is any missing values in ajexdi and divIndex
    lst_vars = ['modAjexdi', 'divIndex', 'price', 'buy_sell', 'volume']
    print("Number of obs in data: {}".format(len(data_main)))
    for x in lst_vars:
        print("missing values in {}: {}".format(x, np.sum(np.isnan(data_main[x]))))

    # Use forward fill to impute ajexdi; prccd, divIndex;
    data_main = data_main.groupby(by=['fund_id', 'isin'], as_index=False).apply(data_main_fill_na)
    print("Need to get rid of missing ajexdi or divIndex.")
    data_main = data_main.ix[data_main['modAjexdi'].notnull()].ix[data_main['divIndex'].notnull()]

    # Check for over-sales and compute holdings
    # Some funds may "over-sale" their stocks and causes negative positions. This is prohibited. Any sells that cause negative balance are lowered to result in a zero balance.
    data_main = data_main.sort_values(by=['fund_id', 'isin', 'date', 'seq']).reset_index(drop=True)
##

    data_main = calculate_fifo_profit(data_main)
    
    if Balanced_Panel:
        # Expand the data_main - to fill in daily holdings in between transaction records.
        data_main = expand_data(data_main)
    
    data_main.to_excel(output_folder + "data_main.xlsx")
    data_stocks.to_excel(output_folder + "data_stocks.xlsx")



In [ ]:

    
# Execeute FIFO profile calculations and obtain the output datasets
main(data_folder = Data_Folder,
     output_folder = Output_Folder,
     filename_transactions = Datafile_Main,
     filename_stocks = Datafile_stocks,
     Balanced_Panel = Balanced_Panel,
     Account_For_Dividend = Dividends_Reinvested,
     Adjust_Price_Before_1999 = Adjust_Price_Before_1999,
     debug = False
     )

	fund_id	isin	acc	sect	owntype	legtype	ref_code	buy_sell	volume	price	...	language	prccd	ajexdi	cshoc	trade_date	class	date	seq
0	FI0008803119	FI0009000251	1128458	260	1	31	1	10	1000	228	...	1	227.00	3.4137	6124980	1994-12-27	FI	1994-12-27	1
1	FI0008803119	FI0009000582	1128458	260	1	31	1	10	15000	131	...	1	129.00	1.0000	82000000	1994-12-29	FI	1994-12-29	1
2	FI0008803119	FI0009000707	1128458	260	1	31	1	20	600	702	...	1	700.00	8.0000	37596439	1994-12-29	FI	1994-12-29	1
3	FI0008803119	FI0009000707	1128458	260	1	31	1	20	200	702	...	1	700.00	8.0000	37596439	1994-12-29	FI	1994-12-29	2
4	FI0008803119	FI0009000012	1128458	260	1	31	0	10	196000	0	...	1	5.53	1.0000	726845084	1995-01-01	FI	1995-01-01	1
5	FI0008803119	FI0009000053	1128458	260	1	31	0	10	294000	0	...	1	12.20	1.0000	416794279	1995-01-01	FI	1995-01-01	1
6	FI0008803119	FI0009000061	1128458	260	1	31	0	10	150000	0	...	1	11.80	1.0000	67200000	1995-01-01	FI	1995-01-01	1
7	FI0008803119	FI0009000145	1128458	260	1	31	0	10	67000	0	...	1	51.60	3.0000	20070000	1995-01-01	FI	1995-01-01	1
8	FI0008803119	FI0009000202	1128458	260	1	31	0	10	90000	0	...	1	55.00	1.0000	90213400	1995-01-01	FI	1995-01-01	1
9	FI0008803119	FI0009000251	1128458	260	1	31	0	10	12500	0	...	1	226.00	3.4137	6124980	1995-01-01	FI	1995-01-01	1

	count	mean	median	stddev	skewness	kurtosis	quantile1	quantile2	quantile3	first_obs_index	last_obs_index	first_obs	last_obs
volume	290141	1.724857e+04	2.000000e+03	8.851346e+04	46.630421	4273.024135	500.00	2.000000e+03	1.000000e+04	127041	185236	1994-12-19	2014-12-26
price	290141	4.633595e+01	1.713000e+01	8.849064e+01	4.661467	30.979461	10.18	1.713000e+01	3.530000e+01	127041	185236	1994-12-19	2014-12-26
prccd	287501	4.708490e+01	1.737000e+01	8.993040e+01	4.693060	31.570475	10.40	1.737000e+01	3.600000e+01	127041	185236	1994-12-19	2014-12-26
ajexdi	287501	2.164172e+00	1.000000e+00	3.867670e+00	7.520359	87.975780	1.00	1.000000e+00	1.237900e+00	127041	185236	1994-12-19	2014-12-26
cshoc	287465	4.049428e+08	1.277820e+08	8.712507e+08	3.873649	14.645906	60844580.00	1.277820e+08	2.688760e+08	127041	185236	1994-12-19	2014-12-26

	code	desc	obs	percent
0	FI	Foreign nominee account where most foreign hol...	270883	93.362538
1	NF	Non-financial	13026	4.489541
2	NA	Not Available	6232	2.147921
3	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	code	desc	obs	percent
0	51	FOUNDATION, FUND	155321	53.532937
1	90	OTHER JURIDICAL FORM	100277	34.561472
2	31	LIMITED COMPANY	18410	6.345191
3	30	PUBLIC LIMITED COMPANY	8609	2.967178
4	54	ECONOMIC ASSOCIATION	7524	2.593222
5	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	code	desc	obs	percent
0	1	MARKETPLACE TRADE	225797	77.823196
1	2	OTHER TRADE	57017	19.651480
2	4	TRANSFER TO OWN ACCOUNT	2732	0.941611
3	76	STOCK LOAN	1004	0.346039
4	77	RETURN OF STOCK LOAN	977	0.336733
5	96	ACCOUNT TRANSFER	454	0.156476
6	99	OTHER	416	0.143379
7	55	SPLIT	390	0.134417
8	98	CORRECTION	287	0.098917
9	57	COMBINATION	247	0.085131
10	0	Not Available	195	0.067209
11	51	BONUS ISSUE	159	0.054801
12	56	MERGER	156	0.053767
13	50	RIGHTS ISSUE	98	0.033777
14	52	SUBSCRIPTION	97	0.033432
15	60	SUBSCRIPTION RIGHT	27	0.009306
16	62	SECURITY TYPE CONVERSION	24	0.008272
17	75	DEMERGER	24	0.008272
18	58	CANCELLATION OF A SECURITY	17	0.005859
19	15	TRADEBYTRADE SETTLEMENT	9	0.003102
20	3	TRANSFER TO BOOKENTRIES	7	0.002413
21	64	REVERSE SPLIT	3	0.001034
22	16	DIRECT SELLING	2	0.000689
23	7	INHERITANCE	2	0.000689
24	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	code	desc	obs	percent
0	10	buy	150573	51.896492
1	20	sell	139557	48.099717
2	11	what is this?	7	0.002413
3	21	Not sure what is this?	4	0.001379
4	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	code	desc	obs	percent
0	123000	NaN	135802	46.805519
1	260	NaN	127215	43.845923
2	111211	NaN	8063	2.778994
3	122200	NaN	7866	2.711096
4	124100	NaN	5149	1.774654
5	121	NaN	4963	1.710548
6	111020	NaN	1083	0.373267
7	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	code	desc	obs	percent
0	EUR	NaN	231290	79.716414
1	FIM	NaN	58851	20.283586
2	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	code	desc	obs	percent
0	1	NaN	102049	77.205738
1	2	NaN	30129	22.794262
2	NaN	non-missing obs; <-- divided by total obs;	132178	45.556471

	code	desc	obs	percent
0	101	NaN	67780	23.361055
1	100	NaN	60932	21.000824
2	510	NaN	56073	19.326121
3	20	NaN	46471	16.016695
4	131	NaN	25457	8.774010
5	75	NaN	21277	7.333331
6	2650	NaN	6036	2.080368
7	180	NaN	3110	1.071893
8	120	NaN	3005	1.035703
9	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	code	desc	obs	percent
0		NaN	247761	85.393309
1	FI	NaN	42380	14.606691
2	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	code	desc	obs	percent
0	1	NaN	237363	81.809534
1	2	NaN	52778	18.190466
2	NaN	non-missing obs; <-- divided by total obs;	290141	100.000000

	avg_price	avg_prccd	avg_ajexdi	avg_cshoc	sum_buy_value	sum_sell_value	sum_buy_vol	sum_sell_vol	sum_buys	sum_sells
date
1994-12-31	153.809259	150.693889	6.528385	1.104380e+08	35.499000	26.813600	567400	589800	78	84
1995-12-31	138.213604	141.753156	6.147319	1.234849e+08	3683.562927	3833.692808	69590197	59328435	6512	7352
1996-12-31	116.350465	116.608281	5.765767	1.188861e+08	3784.796712	4072.432073	57469672	58546592	7438	8747
1997-12-31	161.441175	161.995295	4.636617	1.342176e+08	4616.198175	4483.420826	59152263	55704503	6818	6925
1998-12-31	208.063459	210.618799	4.112164	1.912685e+08	5439.818231	5210.158386	52115245	48363473	7748	7265
1999-12-31	34.303839	34.479339	2.675440	3.979546e+08	897.252070	899.840589	49069080	46026064	7576	7739
2000-12-31	33.533031	33.803717	1.948758	5.694018e+08	1437.123104	1316.744530	73429956	62885140	6949	6682
2001-12-31	13.694342	13.775393	1.569920	5.771282e+08	734.365436	654.063640	80264119	59810924	5268	4839
2002-12-31	13.404859	13.470798	1.533537	8.072455e+08	629.125871	590.531031	63529511	65233958	5174	4418
2003-12-31	11.696425	11.695220	1.482651	8.290779e+08	588.850666	442.608437	71239436	55856913	6605	4442
2004-12-31	12.495043	13.001033	1.570942	5.861513e+08	722.641738	601.799095	70193837	133266157	5059	2740
2005-12-31	17.840684	18.156664	1.550864	3.987343e+08	733.881262	739.963005	52969789	56608478	2330	1996
2006-12-31	18.392183	19.190453	1.303226	3.907589e+08	228.230178	229.494766	14859885	16878252	478	493
2007-12-31	23.090562	23.103537	1.246037	3.367412e+08	2350.847580	2367.310627	135339817	129138045	21024	15950
2008-12-31	17.967013	17.942003	1.175673	4.231516e+08	2557.008426	2484.484775	190018581	172309766	20855	18774
2009-12-31	12.137364	12.161051	1.203043	4.793680e+08	2346.552330	2049.202083	277441903	227760326	15264	13127
2010-12-31	14.542960	14.580589	1.136956	3.698097e+08	2219.364566	2092.553943	201083780	183930126	6047	5846
2011-12-31	15.275010	15.217217	1.107915	4.156995e+08	2086.850055	2025.551047	204763850	183538495	5758	5890
2012-12-31	13.666824	14.376986	1.124224	4.621582e+08	1993.079458	1904.954811	336439421	283760594	5470	6552
2013-12-31	16.170372	16.891658	1.047625	4.710433e+08	3043.694423	3103.375720	382381613	398314857	5600	6958
2014-12-31	14.727076	15.629244	1.013533	4.705342e+08	1181.346339	1396.723934	131832665	132913061	2529	2742