In this file, we prepare the first of a few CRSP files we will be using in the TAQ/CRSP merge. The main goals are to: 1) select the subset of securities in the original dsenames.csv file that we want to examine; 2) separate some ticker symbols into two parts, a symbol root and a symbol suffix, and 3) extract the time period during which a security was traded.

In order to work with the CRSP data, we will be using the pandas library, a rather intuitive Python tool for working with panel data.


In [2]:
import pandas as pd
import numpy as np

dsenames = pd.read_csv('../crsp/dsenames.csv')

The first subsetting to be done is by time period. For Justin's purposes, we are just looking at securities that traded from 2010 onwards.


In [3]:
dsenames.head(10)


Out[3]:
PERMNO NAMEDT NAMEENDT SHRCD EXCHCD SICCD NCUSIP TICKER COMNAM SHRCLS ... NAICS PRIMEXCH TRDSTAT SECSTAT PERMCO COMPNO ISSUNO HEXCD HSICCD CUSIP
0 10000 19860107 19861203 10 3 3990 68391610 OMFGA OPTIMUM MANUFACTURING INC A ... NaN Q A R 7952 60007905 10396 3 3990 68391610
1 10000 19861204 19870309 10 3 3990 68391610 OMFGA OPTIMUM MANUFACTURING INC A ... NaN Q A R 7952 60007905 10396 3 3990 68391610
2 10000 19870310 19870611 10 3 3990 68391610 OMFGA OPTIMUM MANUFACTURING INC A ... NaN Q A R 7952 60007905 10396 3 3990 68391610
3 10001 19860109 19931121 11 3 4920 39040610 GFGC GREAT FALLS GAS CO NaN ... NaN Q A R 7953 60007906 10398 2 4925 36720410
4 10001 19931122 20040609 11 3 4920 29274A10 EWST ENERGY WEST INC NaN ... NaN Q A R 7953 60007906 10398 2 4925 36720410
5 10001 20040610 20041018 11 3 4920 29274A10 EWST ENERGY WEST INC NaN ... 221210 Q A R 7953 60007906 10398 2 4925 36720410
6 10001 20041019 20041226 11 3 4920 29274A10 EWST ENERGY WEST INC NaN ... 221210 Q A R 7953 60007906 10398 2 4925 36720410
7 10001 20041227 20080204 11 3 4920 29274A10 EWST ENERGY WEST INC NaN ... 221210 Q A R 7953 60007906 10398 2 4925 36720410
8 10001 20080205 20080304 11 3 4920 29274A20 EWST ENERGY WEST INC NaN ... 221210 Q A R 7953 60007906 10398 2 4925 36720410
9 10001 20080305 20090803 11 3 4920 29274A20 EWST ENERGY WEST INC NaN ... 221210 Q A R 7953 60007906 10398 2 4925 36720410

10 rows × 21 columns


In [4]:
dsenames = dsenames[dsenames['NAMEENDT']>=20100101]

We are also interested in looking only at securities that are being actively traded during the period in question.


In [5]:
# SECSTAT == "R" indicates that the security is "Regular" (i.e. the security is past the "When-Issued" stage
# and the company is not going through bankruptcy proceedings)
dsenames = dsenames[dsenames['SECSTAT']=="R"]
# TRDSTAT == "A" indicates that the security is actively trading
dsenames = dsenames[dsenames['TRDSTAT']=="A"]

Here, two new variables are created to keep track of the length in the ticker symbol. The tsymbol variable tracks the trading ticker symbol on the security's primary exchange, while the ticker variable, for some stocks, holds a bit more information (e.g. the exchange on which a security is traded). These variables are vital for identifying securities that have a suffix to their trading symbols.


In [6]:
dsenames['SYMBOL_LEN'] = dsenames['TSYMBOL'].str.len()
dsenames['TICKER_LEN'] = dsenames['TICKER'].str.len()
# There are 239 cases where the length of tsymbol does not match up with the length of ticker
# For most of these cases, it is because tsymbol includes a suffix of some sort, while ticker does not

These next two variables, last and last2, store the last (or last two) letters in a company's ticker symbol.


In [7]:
dsenames['LAST'] = dsenames['TSYMBOL'].str[-1]
# This grabs the last two elements of tsymbol
dsenames['LAST2'] = dsenames['TSYMBOL'].str[-2:]

Justin excludes the first two records for being apparently bogus, and the third for disagree substantially with the TAQ data.


In [8]:
bad_permnos = [14209, 15141, 91845]
# Here, we index out any securities whose records don't make sense
dsenames = dsenames[~dsenames['PERMNO'].isin(bad_permnos)]

The following two blocks allow us to remove two types of stock that we want to exclude from analysis: beneficial and when-issued securities.


In [9]:
# The 'S' suffix indicates the "Shares of Beneficial Interest, which do not trade like regular securities
# The 'V' and 'WI' suffixes indicate "When-Issued" shares, which have been authorized to trade, but have not actually begun trading
nasdaq = dsenames.PRIMEXCH == "Q"
beneficial = (dsenames['LAST'] == "S") & (dsenames['SYMBOL_LEN'] == 5)
whenissued_nasdaq = (dsenames['LAST'] == "V") & (dsenames['SYMBOL_LEN'] == 5) & nasdaq
whenissued_nonnasdaq = (dsenames['LAST2'] == "WI") & (dsenames['SYMBOL_LEN'] > 3) & ~nasdaq

In [10]:
# Here, we take out the BoWI shares from bot NASDAQ and non-NASDAQ securities
dsenames = dsenames[~(beneficial & whenissued_nasdaq & whenissued_nonnasdaq)]

Next, I create a few new variables that will be included in the clean dsenames file. The sym_root and sym_suffix variables will hold the two parts of the security's ticker symbol. If the security has a special suffix that indicates something special about the security (for example, a security may have two classes of share, A and B, which are stored in the ticker symbol variable), it will be stored in sym_suffix. If a security does not have a suffix, sym_suffix will be an empty string.


In [11]:
# Flag = 0 is our base case (i.e. the ticker symbol has no suffix)
dsenames['FLAG'] = 0
# When the ticker has no suffix, the root is just the ticker symbol, and the suffix is an empty string
dsenames['SYM_ROOT'] = dsenames['TSYMBOL']
dsenames['SYM_SUFFIX'] = ""

Here, we create indices for the column names in the data that we will be using to identify securities with suffixes. Using indices rather than the column names allows us to utilize the iloc function in pandas to subset the dataframe.


In [12]:
# We're about to do major surgery, make a copy!
orig_dsenames = dsenames.copy()

In [13]:
permno_index = list(dsenames.columns).index('PERMNO')
primexch_index = list(dsenames.columns).index('PRIMEXCH')
last_index = list(dsenames.columns).index('LAST')
last2_index = list(dsenames.columns).index('LAST2')
shrcls_index = list(dsenames.columns).index('SHRCLS')
len_index = list(dsenames.columns).index('SYMBOL_LEN')
len_tick_index = list(dsenames.columns).index('TICKER_LEN')
flag_index = list(dsenames.columns).index('FLAG')
sym_root_index = list(dsenames.columns).index('SYM_ROOT')
sym_suffix_index = list(dsenames.columns).index('SYM_SUFFIX')
tsymbol_index = list(dsenames.columns).index('TSYMBOL')

In [14]:
dsenames.reset_index(drop = True).head()


Out[14]:
PERMNO NAMEDT NAMEENDT SHRCD EXCHCD SICCD NCUSIP TICKER COMNAM SHRCLS ... HEXCD HSICCD CUSIP SYMBOL_LEN TICKER_LEN LAST LAST2 FLAG SYM_ROOT SYM_SUFFIX
0 10001 20091218 20100708 11 2 4925 29269V10 EGAS ENERGY INC NaN ... 2 4925 36720410 4 4 S AS 0 EGAS
1 10001 20100709 20141231 11 2 4925 36720410 EGAS GAS NATURAL INC NaN ... 2 4925 36720410 4 4 S AS 0 EGAS
2 10002 20060630 20130215 11 3 6020 05978R10 BTFG BANCTRUST FINANCIAL GROUP INC NaN ... 3 6020 05978R10 4 4 G FG 0 BTFG
3 10025 20090130 20141231 11 3 3081 00103110 AEPI A E P INDUSTRIES INC NaN ... 3 3081 00103110 4 4 I PI 0 AEPI
4 10026 20040610 20141231 11 3 2052 46603210 JJSF J & J SNACK FOODS CORP NaN ... 3 2052 46603210 4 4 F SF 0 JJSF

5 rows × 28 columns


In [15]:
# Now begins the subsetting based on cases of symbol suffixes

# Since we already have our nasdaq variable from earlier, we can create a few more
# booleans to help us conduct the subsetting in a vectorized way

In [16]:
# This first boolean vector will be True for a ticker symbol if its last character is the same as its share class
class_equal_last = dsenames.SHRCLS == dsenames.LAST

In [17]:
# This boolean will be True for all NASDAQ securities with a ticker symbol longer than 4 characters
## 4 is the maximum number of characters for a ticker symbol on the NASDAQ
nasdaq_long = nasdaq & (dsenames.SYMBOL_LEN > 4)

In [18]:
# The first flag will denote securities with ticker symbols that have share class suffixes
## e.g. a company on the NASDAQ that has Class A and Class B shares

flag1 = nasdaq_long & class_equal_last

In [19]:
# The second flag will denote two different special cases
## Suffixes Y and F denote shares in foreign companies
## Suffixes J and K denote voting and non-voting shares, respectively
flag2 = ~flag1 & nasdaq_long & dsenames.LAST.isin(["Y", "J", "F", "K"])

In [20]:
# The third flag will denote (NASDAQ) stocks that are currently going through a reverse stock split
## These securities usually keep this ticker symbol for about three weeks after the reverse stock split takes place
flag3 = ~flag1 & nasdaq_long & (dsenames.LAST == "D")

In [21]:
# The fourth flag will denote non-NASDAQ stocks that have share class suffixes
## For non-NASDAQ exchanges (e.g. the New York Stock Exchange), 
## ticker symbols without suffixes are 3 characters or less
flag4 = ~nasdaq & (dsenames.SYMBOL_LEN > 3) & class_equal_last

In [22]:
# There is a fifth set of suffixed ticker symbols that do not fit into the above categories,
## but they do have a unifying manual adjustment
## We denote this set as funny

funny_permnos = [85254,29938,29946,93093,92118,83275,82924,82932,77158,46950,90655]

funny = dsenames.PERMNO.isin(funny_permnos) & (dsenames.SYMBOL_LEN - dsenames.TICKER_LEN == 1) & \
            dsenames.LAST.isin(["A","B","C","S"])

In [23]:
# Here, we assign the flags to each special case, as described above

dsenames.loc[flag4, "FLAG"] = 4
dsenames.loc[flag3, "FLAG"] = 3
dsenames.loc[flag2, "FLAG"] = 2
dsenames.loc[flag1, "FLAG"] = 1

In [24]:
# Here, we group together the symboled suffixes to make the final root-suffix separation cleaner
## sym5_with_suffix is the set of special cases with more than 4 characters in the symbol
sym5_with_suffix = flag1 | flag2 | flag3
symbol_with_suffix = flag4 | funny | sym5_with_suffix

In [25]:
# Finally, the big enchilada, the separation of each ticker symbol into its root and its symbol
## Since we are only dealing with suffixes of length 1, the root will consist of all but the last character,
## and the root will be the ticker symbol's last character

dsenames.loc[symbol_with_suffix, "SYM_ROOT"] = dsenames.loc[symbol_with_suffix, "TSYMBOL"].str[0:-1]
dsenames.loc[symbol_with_suffix, "SYM_SUFFIX"] = dsenames.loc[symbol_with_suffix, "TSYMBOL"].str[-1]

In [26]:
# There were a few wonky observations, so we do some additional manual adjustments

dsenames.loc[dsenames.PERMNO == 14461, "SYM_ROOT"] = dsenames.loc[dsenames.PERMNO == 14461, "TSYMBOL"].str[0:-1]
dsenames.loc[dsenames.PERMNO == 14461, "SYM_SUFFIX"] = dsenames.loc[dsenames.PERMNO == 14461, "TSYMBOL"].str[-1]
dsenames.loc[dsenames.PERMNO == 13914, "SYM_ROOT"] = dsenames.loc[dsenames.PERMNO == 13914, "TSYMBOL"]
dsenames.loc[dsenames.PERMNO == 13914, "SYM_SUFFIX"] = ""
dsenames.loc[dsenames.PERMNO == 92895, "SYM_ROOT"] = "SAPX"
dsenames.loc[dsenames.PERMNO == 92895, "SYM_SUFFIX"] = ""

In [27]:
# Finally, we want to ensure that, when the same information is recorded, the date range listed for the record reflects
# the entire range over which the security was actively trading
## For instance, if a security stopped trading for a six month period, it has two entries in this file
## We want both of those entries to include beginning date for the security's trading before the six month break
## and the end date for the security's trading after the six month break

# To do this, we first want to reset the index in the dsenames dataframe
dsenames = dsenames.reset_index(drop = True)

In [28]:
# When we say that we want to adjust the dates 'when the same information is recorded,' we make that adjustment based
# on the following seven variables in the data frame: Permno, the two components of the ticker symbol, the name of 
#the company the CUSIP number (current and historical), and the primary exchange on which the security trades

# We first create a new data frame sorted on these 7 columns, which only includes said 7 columns

levels_sort = ["PERMNO", 'SYM_ROOT', 'SYM_SUFFIX', 'COMNAM', 'CUSIP', 'NCUSIP', 'PRIMEXCH']
dsenames_sort = dsenames.sort_values(by = levels_sort).loc[:, levels_sort]
dsenames = dsenames.sort_values(by = levels_sort)

In [29]:
# We create two new variables, begdate and enddate, to capture the full range of dates for which each security trades
## The default case, when a security only matches with itself based on the 7 sort levels, is that the beginning date 
## is the same as the beginning effective name date, and the end date is the same as the end effective name date

dsenames['BEGDATE'] = dsenames.NAMEDT
dsenames['ENDDATE'] = dsenames.NAMEENDT

In [41]:
# We create a new dataframe that only includes the sort variables
dsenames_sort_squish = dsenames_sort.loc[:, levels_sort]

In [42]:
# Here, we create two copies of the dataframe:
## One without the first record, and one without the last

dsenames_nofirst = dsenames_sort_squish.iloc[1:].reset_index(drop = True)
dsenames_nolast = dsenames_sort_squish.iloc[:-1].reset_index(drop = True)

In [43]:
# We then create a boolean matrix based on whether the entries of each matrix match
compare_matrix = (dsenames_nofirst == dsenames_nolast)

In [44]:
# If the i-th record matches the next record for all 7 variables, then the i-th row of the compare matrix will
# be all true. We extract the index for subsetting purposes
same_as_below = compare_matrix.all(axis = 1)
same_as_below_index = same_as_below.index[same_as_below]

# In order to collapse the end dates, we will also need an index to indicate if a record is the same as the one above
## This is simply caputured by adding 1 to the first index we found
same_as_above_index = same_as_below_index + 1

In [45]:
# Finally, we loop through the first Int64Index we found to bring the earliest BEGDATE for a record down
# to all of its matches
## Doing this matching iteratively mitigates the issue of a particular security having more than 2 records match
## based on the 7 variables
for i in same_as_above_index:
    dsenames['BEGDATE'].iat[i] = dsenames['BEGDATE'].iat[i-1]

In [46]:
# Similar logic is used to bring the latest ENDDATE up - we just do it backwards
for i in same_as_below_index[::-1]:
    dsenames['ENDDATE'].iat[i] = dsenames['ENDDATE'].iat[i+1]

In [47]:
# Finally, we output a final dataframe that includes only the columns we sorted on and our new date variables
# Since the same information is recorded for these files now, we drop the duplicates
final_columns = levels_sort + ['BEGDATE', 'ENDDATE']
clean_dsenames = dsenames.drop_duplicates(subset = final_columns).loc[:, final_columns]

In [49]:
clean_dsenames.head(50)


Out[49]:
PERMNO SYM_ROOT SYM_SUFFIX COMNAM CUSIP NCUSIP PRIMEXCH BEGDATE ENDDATE
0 10001 EGAS ENERGY INC 36720410 29269V10 A 20091218 20100708
1 10001 EGAS GAS NATURAL INC 36720410 36720410 A 20100709 20141231
2 10002 BTFG BANCTRUST FINANCIAL GROUP INC 05978R10 05978R10 Q 20060630 20130215
3 10025 AEPI A E P INDUSTRIES INC 00103110 00103110 Q 20090130 20141231
4 10026 JJSF J & J SNACK FOODS CORP 46603210 46603210 Q 20040610 20141231
5 10028 DGSE D G S E COMPANIES INC 23323G10 23323G10 A 20090817 20141231
7 10032 PLXS PLEXUS CORP 72913210 72913210 Q 20040610 20141231
8 10044 RMCF ROCKY MOUNTAIN CHOCOLATE FAC INC 77467840 77467840 Q 20040610 20141231
11 10051 HGR HANGER INC 41043F20 41043F20 N 20120614 20141231
9 10051 HGR HANGER ORTHOPEDIC GROUP INC 41043F20 41043F20 N 20040610 20120613
12 10065 ADX ADAMS EXPRESS CO 00621210 00621210 N 20040610 20141231
13 10078 JAVA SUN MICROSYSTEMS INC 86681020 86681020 Q 20071211 20100126
14 10092 KDE 4 KIDS ENTERTAINMENT INC 35086510 35086510 N 20060630 20100528
16 10100 ALRN AMERICAN LEARNING CORP NY 02713510 02713510 Q 20100323 20130322
15 10100 AMCE AMERICAN CLAIMS EVALUATION INC 02713510 02514410 Q 20060630 20100322
18 10104 ORCL ORACLE CORP 68389X10 68389X10 N 20130715 20141231
17 10104 ORCL ORACLE CORP 68389X10 68389X10 Q 20040610 20130714
19 10107 MSFT MICROSOFT CORP 59491810 59491810 Q 20040610 20141231
20 10113 AADR ADVISORSHARES TRUST 00768Y20 00768Y20 R 20100721 20141231
21 10116 ACUR ACURA PHARMACEUTICALS INC 00509L70 00509L70 Q 20080204 20141231
22 10137 AYE ALLEGHENY ENERGY INC 01736110 01736110 N 20040610 20110225
23 10138 TROW T ROWE PRICE GROUP INC 74144T10 74144T10 Q 20060630 20141231
24 10143 OSIP O S I PHARMACEUTICALS INC 67104010 67104010 Q 20040610 20100608
25 10145 HON HONEYWELL INTERNATIONAL INC 43851610 43851610 N 20040610 20141231
27 10147 EMC E M C CORP MA 26864810 26864810 N 20040610 20141231
29 10151 AGLS ADVISORSHARES TRUST 00768Y40 00768Y40 R 20111201 20141231
28 10151 GRV ADVISORSHARES TRUST 00768Y40 00768Y40 R 20100709 20111130
30 10158 AMRC AMERESCO INC 02361E10 02361E10 N 20100722 20141231
32 10163 XRIT X RITE INC 98385710 98385710 Q 20060630 20120514
34 10180 AKRX AKORN INC 00972810 00972810 Q 20070207 20141231
35 10182 TECD TECH DATA CORP 87823710 87823710 Q 20060824 20141231
36 10200 RGEN REPLIGEN CORP 75991610 75991610 Q 20040610 20141231
37 10201 AT ATLANTIC POWER CORP 04878Q86 04878Q86 N 20100723 20141231
38 10205 LSBX L S B CORP 50215P10 50215P10 Q 20040610 20101130
39 10207 FUND ROYCE FOCUS TRUST INC 78080N10 78080N10 Q 20040610 20141231
40 10220 BWC BABCOCK & WILCOX CO NEW 05615F10 05615F10 N 20100802 20141231
42 10223 CIS CAMELOT INFORMATION SYSTEMS INC 13322V10 13322V10 N 20100721 20140327
44 10225 BEAM BEAM INC 07373010 07373010 N 20111004 20140430
43 10225 FO FORTUNE BRANDS INC 07373010 34963110 N 20040610 20111003
47 10232 ESBK ELMIRA SAVINGS BANK FSB NY 28966010 28966010 Q 20060630 20120604
48 10232 ESBK ELMIRA SAVINGS BANK NY 28966010 28966010 Q 20120605 20141231
49 10239 BWIN B BALDWIN & LYONS INC 05775520 05775520 Q 20040610 20141231
50 10252 INDB INDEPENDENT BANK CORP MA 45383610 45383610 Q 20040610 20141231
51 10253 DMC DOCUMENT SECURITY SYSTEMS INC 25614T10 25614T10 A 20060630 20110710
53 10253 DSS DOCUMENT SECURITY SYSTEMS INC 25614T10 25614T10 A 20110711 20141231
56 10257 DLHC D L H HOLDINGS CORP 23335Q10 23335Q10 Q 20120627 20141231
55 10257 TSTF TEAMSTAFF INC 23335Q10 87815U30 Q 20080519 20120626
57 10258 CLDX CELLDEX THERAPEUTICS INC 15117B10 15117B10 Q 20081001 20141231
58 10259 SIGM SIGMA DESIGNS INC 82656510 82656510 Q 20060630 20141231
59 10272 TKLC TEKELEC 87910110 87910110 Q 20060714 20120127

In [ ]: