In [1]:
import pandas as pd
import matplotlib as mpl
import datetime as dt
%matplotlib inline


/Users/johntemplon/.virtualenvs/data-analysis/lib/python2.7/site-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))

Load Data:

The data is in two CSVs. Loading both, cleaning up header names and created a single data frame:


In [2]:
f1 = "../data/FTC_2011-43014.csv"
f2 = "../data/FTC_5114-71014.csv"

In [3]:
column_names=["Reference Number", "Created Date", "Consumer Address, City", "Consumer Address, State Code", "Company Name", 
              "Company Phone Area Code", "Company Phone, Number", "Company Phone, Extension", 
              "Complaint Info Initial Contact Method", "Complaint Info Product Service Code"]

In [4]:
df_1 = pd.io.parsers.read_csv(f1, names=column_names, index_col=0, skiprows=1)
df_2 = pd.io.parsers.read_csv(f2, names=column_names, index_col=0, skiprows=1)

In [5]:
df = pd.concat([df_1, df_2])

In [6]:
df.head()


Out[6]:
Created Date Consumer Address, City Consumer Address, State Code Company Name Company Phone Area Code Company Phone, Number Company Phone, Extension Complaint Info Initial Contact Method Complaint Info Product Service Code
Reference Number
28741989 01/04/2011 Waianae HI IRS NaN NaN NaN Phone 2143|9043
28803726 01/10/2011 Marietta GA Georgia State IRS 770 9448020 NaN Phone 9043
28827558 01/12/2011 Fort Worth TX IRS 872 4796242 NaN Phone 2142|9043
28842181 01/13/2011 Orangeburg SC IRS 800 8297650 NaN Phone 9043
28890200 01/19/2011 Morton MN IRS|Readers Digest 876| 4355471| | Phone 4190|9043

Subhead Clean Up Created Dates and State Names:

Converting the created dates to datetime objects and also cleaning up the state names. A few of the columns in the data sometimes contain | to delimit multiple options selected on a complaint. Want to eliminate those. The state function assumes that the first state is the right one if available.

Converting to a datetime allows us to see the huge uptick in IRS scam complaints to the FTC over time.


In [7]:
def clean_state_names(state_name):
    state_name = str(state_name)
    if "|" in state_name:
        pieces = state_name.split("|")
        if pieces[0] != "":
            state = pieces[0]
        elif pieces[1] != "":
            state = pieces[1]
        else:
            state = None
    else:
        state = state_name
    return state

In [8]:
df["clean_state"] = df["Consumer Address, State Code"].apply(lambda x: clean_state_names(x))

In [9]:
df["created_date"] = df["Created Date"].apply(lambda x: dt.datetime.strptime(str(x), "%m/%d/%Y") if not pd.isnull(x) else None)

In [10]:
df.dropna(subset=["created_date"]).set_index("created_date")["Company Name"].resample("M", how="count").plot()


Out[10]:
<matplotlib.axes.AxesSubplot at 0x1106c83d0>

Where Are The Complaints Coming From?

The majority of the scam complaints are by phone.

Phone calls follow a similar pattern as the overall calls.


In [12]:
df.dropna(subset=["created_date"])['Complaint Info Initial Contact Method'].value_counts()


Out[12]:
9043                       7930
Phone Call: Landline       5094
Phone                      1591
Phone Call: Mobile/Cell    1282
Internet/E-mail             207
Mail                         54
Mobile: Text/Email/IM        38
I Initiated Contact          32
Internet Web Site            24
Unknown                      24
9043|9001                    11
Internet (Other)             11
In Person                    10
9001|9043                     8
9043|0350                     7
0350|9043                     5
9000|9043                     5
0323|9043                     4
Fax                           3
6100|9043                     3
Wireless                      3
9043|9046                     2
9043|2142                     2
9043|0323                     1
TV/Radio                      1
9050|9043                     1
9043|6001                     1
9043|9000                     1
9043|6101                     1
Print                         1
dtype: int64

In [13]:
df["phone"] = df['Complaint Info Initial Contact Method'].apply(lambda x: 1 if ("Phone" in str(x)) else 0)

In [14]:
df.dropna(subset=["created_date"]).set_index("created_date")["phone"].resample("M", how="count").plot()


Out[14]:
<matplotlib.axes.AxesSubplot at 0x110850b50>

Finding the Change Over Time:

The trend in the graph is obviously a huge spike. July is only down because the data here only goes through July 10, 2014. Let's look at all the months for which we have a full set of data available and see just how big the change has been over time.


In [15]:
df["month"] = df["created_date"].apply(lambda x: x.month if not pd.isnull(x) else None)
df["year"] = df["created_date"].apply(lambda x: x.year if not pd.isnull(x) else None)

In [16]:
complaints_by_month = pd.DataFrame(df.groupby(["year", "month"])["created_date"].count())

In [17]:
first_six_months_2011 = complaints_by_month.ix[2011].ix[1:6]["created_date"].sum()
first_six_months_2012 = complaints_by_month.ix[2012].ix[1:6]["created_date"].sum()
first_six_months_2013 = complaints_by_month.ix[2013].ix[1:6]["created_date"].sum()
first_six_months_2014 = complaints_by_month.ix[2014].ix[1:6]["created_date"].sum()

In [18]:
print "Pct. Change 2012 vs. 2011: {pct:,.2f}%".format(pct=(first_six_months_2012-first_six_months_2011)/float(first_six_months_2011)*100)
print "Pct. Change 2013 vs. 2012: {pct:,.2f}%".format(pct=(first_six_months_2013-first_six_months_2012)/float(first_six_months_2012)*100)
print "Pct. Change 2014 vs. 2013: {pct:,.2f}%".format(pct=(first_six_months_2014-first_six_months_2013)/float(first_six_months_2013)*100)


Pct. Change 2012 vs. 2011: 5.63%
Pct. Change 2013 vs. 2012: 133.33%
Pct. Change 2014 vs. 2013: 3,529.14%

Where Are These Phone Calls Coming From?


In [19]:
df.dropna(subset=["created_date"])['Company Phone Area Code'].value_counts()


Out[19]:
202    6113
415    1549
585     818
888     674
206     416
530     401
347     368
800     332
866     320
716     305
877     150
844     145
571     134
213     124
917     107
...
202|602|405    1
210|           1
312|           1
|||            1
316            1
251||          1
702|           1
253|202        1
607|202        1
614|516        1
610|732        1
951|325        1
206|206        1
385            1
404|702        1
Length: 503, dtype: int64

In [20]:
top_phone_numbers = pd.DataFrame(df.groupby(['Company Phone Area Code', 'Company Phone, Number'])\
                                 .count()\
                                 .sort("created_date", ascending=False))

In [25]:
top_phone_numbers[:25]


Out[25]:
Created Date Consumer Address, City Consumer Address, State Code Company Name Company Phone, Extension Complaint Info Initial Contact Method Complaint Info Product Service Code clean_state created_date phone month year
Company Phone Area Code Company Phone, Number
585 6331814 165 139 145 165 102 161 58 165 165 165 165 165
6331792 148 108 126 148 55 138 91 148 148 148 148 148
415 2514089 131 108 117 131 62 125 60 131 131 131 131 131
2511037 126 107 115 126 118 126 5 126 126 126 126 126
800 8291040 103 92 94 103 49 101 51 103 103 103 103 103
415 2510522 96 77 84 96 0 92 96 96 96 96 96 96
202 2413122 94 83 86 94 66 94 24 94 94 94 94 94
5069044 89 75 80 89 3 86 86 89 89 89 89 89
2412158 84 64 75 84 78 84 0 84 84 84 84 84
2411435 80 68 76 80 66 80 8 80 80 80 80 80
415 4848569 78 71 72 78 1 75 77 78 78 78 78 78
202 2411375 78 68 72 78 75 78 0 78 78 78 78 78
347 5597963 78 68 68 78 0 75 78 78 78 78 78 78
202 5069319 75 66 68 75 57 75 15 75 75 75 75 75
347 5590378 71 67 69 71 43 69 28 71 71 71 71 71
202 2411818 68 56 56 68 4 60 66 68 68 68 68 68
5068708 67 57 64 67 62 67 0 67 67 67 67 67
2807395 66 57 60 66 0 62 66 66 66 66 66 66
530 2385813 64 55 57 64 2 60 64 64 64 64 64 64
206 4144646 60 50 49 60 39 59 18 60 60 60 60 60
415 2519738 60 55 58 60 54 60 0 60 60 60 60 60
202 5068602 58 51 51 58 56 58 0 58 58 58 58 58
585 4446953 57 45 48 57 51 57 4 57 57 57 57 57
202 5069461 57 45 46 57 55 57 0 57 57 57 57 57
716 2063218 57 53 55 57 7 56 50 57 57 57 57 57

The majority of the phone calls come from the 202 area code. This is the area code for Washington, DC. Digging further into just those phone numbers.


In [26]:
df_202 = df[df["Company Phone Area Code"] == "202"]

In [28]:
df_202.dropna(subset=["created_date"]).set_index("created_date")["phone"].resample("M", how="sum").plot()


Out[28]:
<matplotlib.axes.AxesSubplot at 0x110b2ded0>

In [29]:
complaints_from_202_by_month = pd.DataFrame(df_202.groupby(["year", "month"])\
                                            ["created_date"]\
                                            .count())

In [30]:
complaints_from_202_by_month.head()


Out[30]:
created_date
year month
2011 3 1
4 1
5 2
6 2
7 1

In [31]:
first_six_months_2014_from_202 = complaints_from_202_by_month.ix[2014].ix[1:6]["created_date"].sum()

In [35]:
print "The percentage of phone calls from 202 area during the first six months of 2014 was {0:,.2f}%".format(float(first_six_months_2014_from_202/float(first_six_months_2014)*100))
print "There was just {0} reported scam call from 202 area code in June".format(complaints_from_202_by_month.ix[2014].ix[6].size)


The percentage of phone calls from 202 area during the first six months of 2014 was 38.87%
There was just 1 reported scam call from 202 area code in June

Where did they go? Looking into the 202 numbers revealed that a number of them were from magicJack. The company has been aggressively shutting those scam calls off.


In [36]:
df.dropna(subset=["created_date"])['clean_state'].value_counts()


Out[36]:
CA     2471
nan    1901
NY     1189
FL      965
TX      942
NJ      663
PA      555
VA      545
IL      532
MA      503
GA      437
OH      414
AZ      413
WA      408
MD      360
MI      348
NC      339
IN      287
CO      280
OR      228
NV      223
MN      218
CT      207
WI      203
TN      191
MO      172
SC      145
AL      135
NM      130
KY      129
LA      114
HI      112
DE      102
UT       87
ID       77
WV       74
IA       72
AR       71
MS       70
KS       69
OK       66
DC       65
NH       63
ME       55
NE       53
RI       37
MT       30
WY       18
AK       18
VT       13
SD       12
ND        8
ON        6
PR        3
427       2
430       1
Length: 56, dtype: int64

In [37]:
df.dropna(subset=["created_date"])['Consumer Address, City'].value_counts()


Out[37]:
Las Vegas        130
Houston          112
New York         106
Brooklyn         103
San Jose         100
Los Angeles       95
Phoenix           86
San Diego         83
Chicago           75
San Francisco     71
Charlotte         66
Washington        66
Austin            65
Albuquerque       58
Denver            55
...
Plano|              1
Kettering           1
Radcliff            1
Bogart              1
West Valley City    1
Coon Rapids         1
CHICAGO             1
Sunland             1
Smyrna|             1
Nogales             1
Lauderhill          1
Micco               1
Flagstaff           1
Oakridge            1
Savage              1
Length: 5422, dtype: int64

Normalizing the 2014 Complaints by State for 2013 Population:


In [38]:
df_2014 = df[df["created_date"] >= dt.datetime(year=2014, month=1, day=1)]
states_2014 = df_2014.dropna(subset=["created_date"])\
              .groupby(['clean_state'])

In [39]:
states_df = pd.DataFrame(states_2014.size().order(ascending=False))

In [40]:
states_df["state_abbrev"] = states_df.index

In [41]:
state_abbrev_to_name = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming',
}

In [42]:
import csv
state_pop_dict = {}
with open('../data/state_pop_2013.csv', 'rU') as f:
    no_heads = f.readlines()[1:]
    reader = csv.reader(no_heads)
    for row in reader:
        state_pop_dict[row[0].lstrip("\xca")] = int(row[1].replace(",", ""))

In [43]:
def find_state_pop(abbrev):
    try:
        full_name = state_abbrev_to_name[abbrev]
    except KeyError:
        full_name = None
    if full_name is not None:
        population = state_pop_dict[full_name]
    else:
        population = None
    return population

In [44]:
states_df["state_pop"] = states_df["state_abbrev"].apply(lambda x: find_state_pop(x))

In [45]:
states_df["complaints_per_100000"] = states_df.apply(lambda x: round(x[0] / float(x["state_pop"]) * 100000, 1), axis=1)

In [46]:
states_df.sort("complaints_per_100000", ascending=False)


Out[46]:
0 state_abbrev state_pop complaints_per_100000
clean_state
DE 83 DE 925749 9.0
DC 50 DC 646449 7.7
NV 197 NV 2790136 7.1
HI 97 HI 1404054 6.9
MA 424 MA 6692824 6.3
NJ 555 NJ 8899339 6.2
VA 469 VA 8260405 5.7
CA 2150 CA 38332521 5.6
AZ 353 AZ 6626624 5.3
NY 999 NY 19651127 5.1
NM 107 NM 2085287 5.1
WA 350 WA 6971406 5.0
OR 197 OR 3930065 5.0
CT 171 CT 3596080 4.8
MD 287 MD 5928814 4.8
CO 241 CO 5268367 4.6
NH 55 NH 1323459 4.2
FL 763 FL 19552860 3.9
ID 61 ID 1612136 3.8
GA 366 GA 9992167 3.7
IN 234 IN 6570902 3.6
IL 446 IL 12882135 3.5
PA 436 PA 12773801 3.4
ME 43 ME 1328302 3.2
MN 176 MN 5420380 3.2
RI 30 RI 1051511 2.9
TX 770 TX 26448193 2.9
MI 290 MI 9895622 2.9
WV 53 WV 1854304 2.9
NC 280 NC 9848060 2.8
OH 320 OH 11570808 2.8
WI 157 WI 5742713 2.7
MT 26 MT 1015165 2.6
MO 149 MO 6044171 2.5
WY 14 WY 582658 2.4
SC 106 SC 4774839 2.2
UT 63 UT 2900872 2.2
KY 97 KY 4395295 2.2
TN 143 TN 6495978 2.2
NE 40 NE 1868516 2.1
AL 97 AL 4833722 2.0
LA 88 LA 4625470 1.9
KS 51 KS 2893957 1.8
AK 13 AK 735132 1.8
IA 54 IA 3090416 1.7
MS 48 MS 2991207 1.6
AR 48 AR 2959373 1.6
VT 10 VT 626630 1.6
SD 11 SD 844877 1.3
OK 45 OK 3850568 1.2
ND 7 ND 723393 1.0
PR 2 PR 3615086 0.1
nan 1720 nan NaN NaN
ON 4 ON NaN NaN
427 1 427 NaN NaN

In [ ]: