In this file, we prepare the first of a few CRSP files we will be using in the TAQ/CRSP merge. The main goals are to: 1) select the subset of securities in the original dsenames.csv file that we want to examine; 2) separate some ticker symbols into two parts, a symbol root and a symbol suffix, and 3) extract the time period during which a security was traded.
In order to work with the CRSP data, we will be using the pandas library, a rather intuitive Python tool for working with panel data.
In [2]:
import pandas as pd
import numpy as np
dsenames = pd.read_csv('../crsp/dsenames.csv')
The first subsetting to be done is by time period. For Justin's purposes, we are just looking at securities that traded from 2010 onwards.
In [3]:
dsenames.head(10)
Out[3]:
In [4]:
dsenames = dsenames[dsenames['NAMEENDT']>=20100101]
We are also interested in looking only at securities that are being actively traded during the period in question.
In [5]:
# SECSTAT == "R" indicates that the security is "Regular" (i.e. the security is past the "When-Issued" stage
# and the company is not going through bankruptcy proceedings)
dsenames = dsenames[dsenames['SECSTAT']=="R"]
# TRDSTAT == "A" indicates that the security is actively trading
dsenames = dsenames[dsenames['TRDSTAT']=="A"]
Here, two new variables are created to keep track of the length in the ticker symbol. The tsymbol variable tracks the trading ticker symbol on the security's primary exchange, while the ticker variable, for some stocks, holds a bit more information (e.g. the exchange on which a security is traded). These variables are vital for identifying securities that have a suffix to their trading symbols.
In [6]:
dsenames['SYMBOL_LEN'] = dsenames['TSYMBOL'].str.len()
dsenames['TICKER_LEN'] = dsenames['TICKER'].str.len()
# There are 239 cases where the length of tsymbol does not match up with the length of ticker
# For most of these cases, it is because tsymbol includes a suffix of some sort, while ticker does not
These next two variables, last and last2, store the last (or last two) letters in a company's ticker symbol.
In [7]:
dsenames['LAST'] = dsenames['TSYMBOL'].str[-1]
# This grabs the last two elements of tsymbol
dsenames['LAST2'] = dsenames['TSYMBOL'].str[-2:]
Justin excludes the first two records for being apparently bogus, and the third for disagree substantially with the TAQ data.
In [8]:
bad_permnos = [14209, 15141, 91845]
# Here, we index out any securities whose records don't make sense
dsenames = dsenames[~dsenames['PERMNO'].isin(bad_permnos)]
The following two blocks allow us to remove two types of stock that we want to exclude from analysis: beneficial and when-issued securities.
In [9]:
# The 'S' suffix indicates the "Shares of Beneficial Interest, which do not trade like regular securities
# The 'V' and 'WI' suffixes indicate "When-Issued" shares, which have been authorized to trade, but have not actually begun trading
nasdaq = dsenames.PRIMEXCH == "Q"
beneficial = (dsenames['LAST'] == "S") & (dsenames['SYMBOL_LEN'] == 5)
whenissued_nasdaq = (dsenames['LAST'] == "V") & (dsenames['SYMBOL_LEN'] == 5) & nasdaq
whenissued_nonnasdaq = (dsenames['LAST2'] == "WI") & (dsenames['SYMBOL_LEN'] > 3) & ~nasdaq
In [10]:
# Here, we take out the BoWI shares from bot NASDAQ and non-NASDAQ securities
dsenames = dsenames[~(beneficial & whenissued_nasdaq & whenissued_nonnasdaq)]
Next, I create a few new variables that will be included in the clean dsenames file. The sym_root and sym_suffix variables will hold the two parts of the security's ticker symbol. If the security has a special suffix that indicates something special about the security (for example, a security may have two classes of share, A and B, which are stored in the ticker symbol variable), it will be stored in sym_suffix. If a security does not have a suffix, sym_suffix will be an empty string.
In [11]:
# Flag = 0 is our base case (i.e. the ticker symbol has no suffix)
dsenames['FLAG'] = 0
# When the ticker has no suffix, the root is just the ticker symbol, and the suffix is an empty string
dsenames['SYM_ROOT'] = dsenames['TSYMBOL']
dsenames['SYM_SUFFIX'] = ""
Here, we create indices for the column names in the data that we will be using to identify securities with suffixes. Using indices rather than the column names allows us to utilize the iloc function in pandas to subset the dataframe.
In [12]:
# We're about to do major surgery, make a copy!
orig_dsenames = dsenames.copy()
In [13]:
permno_index = list(dsenames.columns).index('PERMNO')
primexch_index = list(dsenames.columns).index('PRIMEXCH')
last_index = list(dsenames.columns).index('LAST')
last2_index = list(dsenames.columns).index('LAST2')
shrcls_index = list(dsenames.columns).index('SHRCLS')
len_index = list(dsenames.columns).index('SYMBOL_LEN')
len_tick_index = list(dsenames.columns).index('TICKER_LEN')
flag_index = list(dsenames.columns).index('FLAG')
sym_root_index = list(dsenames.columns).index('SYM_ROOT')
sym_suffix_index = list(dsenames.columns).index('SYM_SUFFIX')
tsymbol_index = list(dsenames.columns).index('TSYMBOL')
In [14]:
dsenames.reset_index(drop = True).head()
Out[14]:
In [15]:
# Now begins the subsetting based on cases of symbol suffixes
# Since we already have our nasdaq variable from earlier, we can create a few more
# booleans to help us conduct the subsetting in a vectorized way
In [16]:
# This first boolean vector will be True for a ticker symbol if its last character is the same as its share class
class_equal_last = dsenames.SHRCLS == dsenames.LAST
In [17]:
# This boolean will be True for all NASDAQ securities with a ticker symbol longer than 4 characters
## 4 is the maximum number of characters for a ticker symbol on the NASDAQ
nasdaq_long = nasdaq & (dsenames.SYMBOL_LEN > 4)
In [18]:
# The first flag will denote securities with ticker symbols that have share class suffixes
## e.g. a company on the NASDAQ that has Class A and Class B shares
flag1 = nasdaq_long & class_equal_last
In [19]:
# The second flag will denote two different special cases
## Suffixes Y and F denote shares in foreign companies
## Suffixes J and K denote voting and non-voting shares, respectively
flag2 = ~flag1 & nasdaq_long & dsenames.LAST.isin(["Y", "J", "F", "K"])
In [20]:
# The third flag will denote (NASDAQ) stocks that are currently going through a reverse stock split
## These securities usually keep this ticker symbol for about three weeks after the reverse stock split takes place
flag3 = ~flag1 & nasdaq_long & (dsenames.LAST == "D")
In [21]:
# The fourth flag will denote non-NASDAQ stocks that have share class suffixes
## For non-NASDAQ exchanges (e.g. the New York Stock Exchange),
## ticker symbols without suffixes are 3 characters or less
flag4 = ~nasdaq & (dsenames.SYMBOL_LEN > 3) & class_equal_last
In [22]:
# There is a fifth set of suffixed ticker symbols that do not fit into the above categories,
## but they do have a unifying manual adjustment
## We denote this set as funny
funny_permnos = [85254,29938,29946,93093,92118,83275,82924,82932,77158,46950,90655]
funny = dsenames.PERMNO.isin(funny_permnos) & (dsenames.SYMBOL_LEN - dsenames.TICKER_LEN == 1) & \
dsenames.LAST.isin(["A","B","C","S"])
In [23]:
# Here, we assign the flags to each special case, as described above
dsenames.loc[flag4, "FLAG"] = 4
dsenames.loc[flag3, "FLAG"] = 3
dsenames.loc[flag2, "FLAG"] = 2
dsenames.loc[flag1, "FLAG"] = 1
In [24]:
# Here, we group together the symboled suffixes to make the final root-suffix separation cleaner
## sym5_with_suffix is the set of special cases with more than 4 characters in the symbol
sym5_with_suffix = flag1 | flag2 | flag3
symbol_with_suffix = flag4 | funny | sym5_with_suffix
In [25]:
# Finally, the big enchilada, the separation of each ticker symbol into its root and its symbol
## Since we are only dealing with suffixes of length 1, the root will consist of all but the last character,
## and the root will be the ticker symbol's last character
dsenames.loc[symbol_with_suffix, "SYM_ROOT"] = dsenames.loc[symbol_with_suffix, "TSYMBOL"].str[0:-1]
dsenames.loc[symbol_with_suffix, "SYM_SUFFIX"] = dsenames.loc[symbol_with_suffix, "TSYMBOL"].str[-1]
In [26]:
# There were a few wonky observations, so we do some additional manual adjustments
dsenames.loc[dsenames.PERMNO == 14461, "SYM_ROOT"] = dsenames.loc[dsenames.PERMNO == 14461, "TSYMBOL"].str[0:-1]
dsenames.loc[dsenames.PERMNO == 14461, "SYM_SUFFIX"] = dsenames.loc[dsenames.PERMNO == 14461, "TSYMBOL"].str[-1]
dsenames.loc[dsenames.PERMNO == 13914, "SYM_ROOT"] = dsenames.loc[dsenames.PERMNO == 13914, "TSYMBOL"]
dsenames.loc[dsenames.PERMNO == 13914, "SYM_SUFFIX"] = ""
dsenames.loc[dsenames.PERMNO == 92895, "SYM_ROOT"] = "SAPX"
dsenames.loc[dsenames.PERMNO == 92895, "SYM_SUFFIX"] = ""
In [27]:
# Finally, we want to ensure that, when the same information is recorded, the date range listed for the record reflects
# the entire range over which the security was actively trading
## For instance, if a security stopped trading for a six month period, it has two entries in this file
## We want both of those entries to include beginning date for the security's trading before the six month break
## and the end date for the security's trading after the six month break
# To do this, we first want to reset the index in the dsenames dataframe
dsenames = dsenames.reset_index(drop = True)
In [28]:
# When we say that we want to adjust the dates 'when the same information is recorded,' we make that adjustment based
# on the following seven variables in the data frame: Permno, the two components of the ticker symbol, the name of
#the company the CUSIP number (current and historical), and the primary exchange on which the security trades
# We first create a new data frame sorted on these 7 columns, which only includes said 7 columns
levels_sort = ["PERMNO", 'SYM_ROOT', 'SYM_SUFFIX', 'COMNAM', 'CUSIP', 'NCUSIP', 'PRIMEXCH']
dsenames_sort = dsenames.sort_values(by = levels_sort).loc[:, levels_sort]
dsenames = dsenames.sort_values(by = levels_sort)
In [29]:
# We create two new variables, begdate and enddate, to capture the full range of dates for which each security trades
## The default case, when a security only matches with itself based on the 7 sort levels, is that the beginning date
## is the same as the beginning effective name date, and the end date is the same as the end effective name date
dsenames['BEGDATE'] = dsenames.NAMEDT
dsenames['ENDDATE'] = dsenames.NAMEENDT
In [41]:
# We create a new dataframe that only includes the sort variables
dsenames_sort_squish = dsenames_sort.loc[:, levels_sort]
In [42]:
# Here, we create two copies of the dataframe:
## One without the first record, and one without the last
dsenames_nofirst = dsenames_sort_squish.iloc[1:].reset_index(drop = True)
dsenames_nolast = dsenames_sort_squish.iloc[:-1].reset_index(drop = True)
In [43]:
# We then create a boolean matrix based on whether the entries of each matrix match
compare_matrix = (dsenames_nofirst == dsenames_nolast)
In [44]:
# If the i-th record matches the next record for all 7 variables, then the i-th row of the compare matrix will
# be all true. We extract the index for subsetting purposes
same_as_below = compare_matrix.all(axis = 1)
same_as_below_index = same_as_below.index[same_as_below]
# In order to collapse the end dates, we will also need an index to indicate if a record is the same as the one above
## This is simply caputured by adding 1 to the first index we found
same_as_above_index = same_as_below_index + 1
In [45]:
# Finally, we loop through the first Int64Index we found to bring the earliest BEGDATE for a record down
# to all of its matches
## Doing this matching iteratively mitigates the issue of a particular security having more than 2 records match
## based on the 7 variables
for i in same_as_above_index:
dsenames['BEGDATE'].iat[i] = dsenames['BEGDATE'].iat[i-1]
In [46]:
# Similar logic is used to bring the latest ENDDATE up - we just do it backwards
for i in same_as_below_index[::-1]:
dsenames['ENDDATE'].iat[i] = dsenames['ENDDATE'].iat[i+1]
In [47]:
# Finally, we output a final dataframe that includes only the columns we sorted on and our new date variables
# Since the same information is recorded for these files now, we drop the duplicates
final_columns = levels_sort + ['BEGDATE', 'ENDDATE']
clean_dsenames = dsenames.drop_duplicates(subset = final_columns).loc[:, final_columns]
In [49]:
clean_dsenames.head(50)
Out[49]:
In [ ]: