In [6]:
%matplotlib inline
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')

In [7]:
df = pd.read_csv("New_York_City_Leading_Causes_of_Death.csv")

In [62]:
df


Out[62]:
Year Ethnicity Sex Cause of Death Count Percent
0 2010 NON-HISPANIC BLACK MALE HUMAN IMMUNODEFICIENCY VIRUS DISEASE 297 5
1 2010 NON-HISPANIC BLACK MALE INFLUENZA AND PNEUMONIA 201 3
2 2010 NON-HISPANIC BLACK MALE INTENTIONAL SELF-HARM (SUICIDE) 64 1
3 2010 NON-HISPANIC BLACK MALE MALIGNANT NEOPLASMS 1540 23
4 2010 NON-HISPANIC BLACK MALE MENTAL DISORDERS DUE TO USE OF ALCOHOL 50 1
5 2010 NON-HISPANIC BLACK MALE NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS 70 1
6 2010 NON-HISPANIC BLACK MALE PEPTIC ULCER 13 0
7 2010 NON-HISPANIC BLACK MALE PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING 111 2
8 2010 NON-HISPANIC BLACK MALE SEPTICEMIA 36 1
9 2010 NON-HISPANIC BLACK MALE SHORT GESTATION/LBW 35 1
10 2010 NON-HISPANIC BLACK MALE VIRAL HEPATITIS 49 1
11 2010 NON-HISPANIC WHITE FEMALE ACCIDENTS EXCEPT DRUG POISONING 164 1
12 2010 NON-HISPANIC WHITE FEMALE ALZHEIMERS DISEASE 247 2
13 2010 NON-HISPANIC WHITE FEMALE AORTIC ANEURYSM AND DISSECTION 22 0
14 2010 NON-HISPANIC WHITE FEMALE ATHEROSCLEROSIS 74 1
15 2010 NON-HISPANIC WHITE FEMALE BENIGN AND UNCERTAIN NEOPLASMS 69 1
16 2010 NON-HISPANIC WHITE FEMALE CEREBROVASCULAR DISEASE 382 3
17 2010 NON-HISPANIC WHITE FEMALE CHOLELITHIASIS AND DISORDERS OF GALLBLADDER 19 0
18 2010 NON-HISPANIC WHITE FEMALE CHRONIC LIVER DISEASE AND CIRRHOSIS 67 0
19 2010 NON-HISPANIC WHITE FEMALE CHRONIC LOWER RESPIRATORY DISEASES 502 4
20 2010 NON-HISPANIC WHITE FEMALE CONGENITAL MALFORMATIONS,DEFORMATIONS 32 0
21 2010 NON-HISPANIC WHITE FEMALE DIABETES MELLITUS 258 2
22 2010 NON-HISPANIC WHITE FEMALE DISEASES OF HEART 5351 40
23 2010 NON-HISPANIC WHITE FEMALE ESSENTIAL HYPERTENSION AND RENAL DISEASES 219 2
24 2010 NON-HISPANIC WHITE FEMALE HUMAN IMMUNODEFICIENCY VIRUS DISEASE 24 0
25 2010 NON-HISPANIC WHITE FEMALE INFLUENZA AND PNEUMONIA 707 5
26 2010 NON-HISPANIC WHITE FEMALE INTENTIONAL SELF-HARM (SUICIDE) 61 0
27 2010 NON-HISPANIC WHITE FEMALE MALIGNANT NEOPLASMS 3438 25
28 2010 NON-HISPANIC WHITE FEMALE MENTAL DISORDERS DUE TO USE OF ALCOHOL 16 0
29 2010 NON-HISPANIC WHITE FEMALE NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS 90 1
... ... ... ... ... ... ...
3810 2008 NON-HISPANIC BLACK MALE ESSENTIAL HYPERTENSION AND RENAL DISEASES 134 2
3811 2008 NON-HISPANIC BLACK MALE HUMAN IMMUNODEFICIENCY VIRUS DISEASE 356 5
3812 2008 NON-HISPANIC BLACK MALE INFLUENZA AND PNEUMONIA 222 3
3813 2008 NON-HISPANIC BLACK MALE INTENTIONAL SELF-HARM (SUICIDE) 45 1
3814 2008 NON-HISPANIC BLACK MALE MALIGNANT NEOPLASMS 1464 22
3815 2008 NON-HISPANIC BLACK MALE MENTAL DISORDERS DUE TO USE OF ALCOHOL 37 1
3816 2008 NON-HISPANIC BLACK MALE NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS 54 1
3817 2008 NON-HISPANIC BLACK MALE PEPTIC ULCER 17 0
3818 2008 NON-HISPANIC BLACK MALE PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING 132 2
3819 2008 NON-HISPANIC BLACK MALE RESPIRATORY DISTRESS OF NEWBORN 18 0
3820 2008 NON-HISPANIC BLACK MALE SEPTICEMIA 49 1
3821 2008 NON-HISPANIC BLACK MALE SHORT GESTATION/LBW 22 0
3822 2008 NON-HISPANIC BLACK MALE VIRAL HEPATITIS 62 1
3823 2008 NON-HISPANIC WHITE FEMALE ACCIDENTS EXCEPT DRUG POISONING 193 1
3824 2008 NON-HISPANIC WHITE FEMALE ALZHEIMERS DISEASE 151 1
3825 2008 NON-HISPANIC WHITE FEMALE AORTIC ANEURYSM AND DISSECTION 26 0
3826 2008 NON-HISPANIC WHITE FEMALE ATHEROSCLEROSIS 54 0
3827 2008 NON-HISPANIC WHITE FEMALE BENIGN AND UNCERTAIN NEOPLASMS 65 0
3828 2008 NON-HISPANIC WHITE FEMALE CEREBROVASCULAR DISEASE 379 3
3829 2008 NON-HISPANIC WHITE FEMALE CHOLELITHIASIS AND DISORDERS OF GALLBLADDER 20 0
3830 2008 NON-HISPANIC WHITE FEMALE CHRONIC LIVER DISEASE AND CIRRHOSIS 61 0
3831 2008 NON-HISPANIC WHITE FEMALE CHRONIC LOWER RESPIRATORY DISEASES 518 4
3832 2008 NON-HISPANIC WHITE FEMALE CONGENITAL MALFORMATIONS,DEFORMATIONS 42 0
3833 2008 NON-HISPANIC WHITE FEMALE DIABETES MELLITUS 210 1
3834 2008 NON-HISPANIC WHITE FEMALE DISEASES OF HEART 6836 48
3835 2008 NON-HISPANIC WHITE FEMALE ESSENTIAL HYPERTENSION AND RENAL DISEASES 151 1
3836 2008 NON-HISPANIC WHITE FEMALE HUMAN IMMUNODEFICIENCY VIRUS DISEASE 25 0
3837 2008 NON-HISPANIC WHITE FEMALE INFLUENZA AND PNEUMONIA 621 4
3838 2008 NON-HISPANIC WHITE FEMALE INTENTIONAL SELF-HARM (SUICIDE) 57 0
3839 2008 NON-HISPANIC WHITE FEMALE MALIGNANT NEOPLASMS 3366 24

3840 rows × 6 columns


In [7]:
df.columns


Out[7]:
Index(['Year', 'Ethnicity', 'Sex', 'Cause of Death', 'Count', 'Percent'], dtype='object')

0. How is this data distributed?

The presence of extrame values is a common pattern in this dataframe. We need to be carefull while using cetered measures.


In [131]:
df['Count'].plot.box()


Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0x120999240>

1. In what year new yorkers died the most?

2007 was the year that more cases of death were reported. However, not a single year was very far away from the mean.


In [72]:
df.groupby('Year')['Count'].sum().sort_values(ascending=False)


Out[72]:
Year
2007    196508
2008    195500
2009    190772
2010    186092
2011    184276
Name: Count, dtype: int64

In [86]:
df.groupby('Year')['Count'].sum().mean()


Out[86]:
190629.60000000001

In [94]:
fig, ax = plt.subplots(figsize=(9, 6))
df.groupby('Year')['Count'].sum().plot.barh()
mean = df.groupby('Year')['Count'].sum().mean()
ax.plot([mean, mean], [0, 12], c='blue', linestyle="-", linewidth=0.5)
ax.annotate(s="Mean of death registered, 190,629.6", xy=(120000,0), color='Blue')


Out[94]:
<matplotlib.text.Annotation at 0x11cb0d048>

2. Who is more likely to die, a male newyorker or a female new yorker?

Surprisingly, women are most likely to die than men in NYC.


In [70]:
df.groupby('Sex')['Count'].sum()


Out[70]:
Sex
FEMALE    484024
MALE      469124
Name: Count, dtype: int64

3. Is Sex (in)difference in the data the same for all the years available?

Diference is not very significants, but in all years mortality was more frequent in women in NYC.


In [119]:
df.groupby(['Year', 'Sex'])['Count'].sum()


Out[119]:
Year  Sex   
2007  FEMALE    100528
      MALE       95980
2008  FEMALE    100040
      MALE       95460
2009  FEMALE     96716
      MALE       94056
2010  FEMALE     93392
      MALE       92700
2011  FEMALE     93348
      MALE       90928
Name: Count, dtype: int64

In [127]:
fig, ax = plt.subplots(figsize=(9, 7))
df.groupby(['Year', 'Sex'])['Count'].sum().plot(color=['darkred', 'blue'],kind='bar', title="deaths by gender over time")
ax.set_ylabel('total deaths')
ax.set_ylabel('Sex')
ax.set_ylim((0,110000))


Out[127]:
(0, 110000)

4. Which one are the most common deseases for men and woman? Is there a difference?

Mayor causes of death like heart deseases or Cancer seems to behave similar in the top 3 for men and women.


In [157]:
df.groupby(['Cause of Death', 'Sex'])['Count'].sum().sort_values(ascending=False).head(6)


Out[157]:
Cause of Death           Sex   
DISEASES OF HEART        FEMALE    208248
                         MALE      177052
MALIGNANT NEOPLASMS      FEMALE    133096
                         MALE      129292
INFLUENZA AND PNEUMONIA  FEMALE     24820
                         MALE       21696
Name: Count, dtype: int64

5. What about in other deseases? Are there any sex differences?

Crebrovascular, diabetes and cronic lower respiratory deseases are more comen in women than men. On the other hand, Diabetes, HIV and deaths by accidents (note related to drugs) are most common in men.


In [160]:
df.groupby(['Cause of Death', 'Sex'])['Count'].sum().sort_values(ascending=False)


Out[160]:
Cause of Death                                    Sex   
DISEASES OF HEART                                 FEMALE    208248
                                                  MALE      177052
MALIGNANT NEOPLASMS                               FEMALE    133096
                                                  MALE      129292
INFLUENZA AND PNEUMONIA                           FEMALE     24820
                                                  MALE       21696
CEREBROVASCULAR DISEASE                           FEMALE     18080
DIABETES MELLITUS                                 FEMALE     17592
CHRONIC LOWER RESPIRATORY DISEASES                FEMALE     17260
DIABETES MELLITUS                                 MALE       15428
CHRONIC LOWER RESPIRATORY DISEASES                MALE       14584
CEREBROVASCULAR DISEASE                           MALE       13012
ACCIDENTS EXCEPT DRUG POISONING                   MALE       12856
HUMAN IMMUNODEFICIENCY VIRUS DISEASE              MALE       12224
PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING  MALE       10492
ESSENTIAL HYPERTENSION AND RENAL DISEASES         FEMALE     10384
ASSAULT (HOMICIDE)                                MALE        8684
ESSENTIAL HYPERTENSION AND RENAL DISEASES         MALE        7924
INTENTIONAL SELF-HARM (SUICIDE)                   MALE        7100
CHRONIC LIVER DISEASE AND CIRRHOSIS               MALE        6960
ACCIDENTS EXCEPT DRUG POISONING                   FEMALE      6904
ALZHEIMERS DISEASE                                FEMALE      6724
HUMAN IMMUNODEFICIENCY VIRUS DISEASE              FEMALE      6304
VIRAL HEPATITIS                                   MALE        4948
NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS       FEMALE      4320
                                                  MALE        4084
PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING  FEMALE      4032
SEPTICEMIA                                        FEMALE      3692
MENTAL DISORDERS DUE TO USE OF ALCOHOL            MALE        3380
CHRONIC LIVER DISEASE AND CIRRHOSIS               FEMALE      3100
SEPTICEMIA                                        MALE        3032
CONGENITAL MALFORMATIONS,DEFORMATIONS             MALE        2684
ALZHEIMERS DISEASE                                MALE        2660
BENIGN AND UNCERTAIN NEOPLASMS                    MALE        2460
                                                  FEMALE      2444
INTENTIONAL SELF-HARM (SUICIDE)                   FEMALE      2408
VIRAL HEPATITIS                                   FEMALE      2336
CONGENITAL MALFORMATIONS,DEFORMATIONS             FEMALE      2168
ATHEROSCLEROSIS                                   FEMALE      2132
AORTIC ANEURYSM AND DISSECTION                    MALE        2044
PARKINSONS DISEASE                                MALE        1864
ASSAULT (HOMICIDE)                                FEMALE      1584
ATHEROSCLEROSIS                                   MALE        1520
SHORT GESTATION/LBW                               MALE        1132
PARKINSONS DISEASE                                FEMALE      1128
AORTIC ANEURYSM AND DISSECTION                    FEMALE      1112
SHORT GESTATION/LBW                               FEMALE       992
PEPTIC ULCER                                      MALE         804
                                                  FEMALE       768
PREGNANCY, CHILDBIRTH AND THE PUERPERIUM          FEMALE       628
MENTAL DISORDERS DUE TO USE OF ALCOHOL            FEMALE       592
CARDIOVASCULAR DISORDERS IN PERINATAL PERIOD      MALE         540
ANEMIAS                                           MALE         416
CHOLELITHIASIS AND DISORDERS OF GALLBLADDER       FEMALE       412
ANEMIAS                                           FEMALE       404
CARDIOVASCULAR DISORDERS IN PERINATAL PERIOD      FEMALE       328
RESPIRATORY DISTRESS OF NEWBORN                   MALE         100
CHOLELITHIASIS AND DISORDERS OF GALLBLADDER       MALE          92
TUBERCULOSIS                                      MALE          60
PNEUMONITIS DUE TO SOLIDS AND LIQUIDS             FEMALE        32
Name: Count, dtype: int64

6. What is the leading cause of death?

NYC has a real heart problem


In [128]:
#Disease of heart is the number 1
df.groupby('Cause of Death')['Count'].sum().sort_values(ascending=False).head(1)


Out[128]:
Cause of Death
DISEASES OF HEART    385300
Name: Count, dtype: int64

7. What other deseases are frequent in newyorkers?

Top 10 deseases that causes death in NYC


In [111]:
df.groupby('Cause of Death')['Count'].sum().sort_values(ascending=False).head(10)


Out[111]:
Cause of Death
DISEASES OF HEART                                   385300
MALIGNANT NEOPLASMS                                 262388
INFLUENZA AND PNEUMONIA                              46516
DIABETES MELLITUS                                    33020
CHRONIC LOWER RESPIRATORY DISEASES                   31844
CEREBROVASCULAR DISEASE                              31092
ACCIDENTS EXCEPT DRUG POISONING                      19760
HUMAN IMMUNODEFICIENCY VIRUS DISEASE                 18528
ESSENTIAL HYPERTENSION AND RENAL DISEASES            18308
PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING     14524
Name: Count, dtype: int64

Hearth Deseases and Cancer are by far the most common causes of death!


In [104]:
fig, ax = plt.subplots(figsize=(9, 7))
df.groupby('Cause of Death')['Count'].sum().sort_values(ascending=True).plot.barh()
ax.set_xlim((0,400000))


Out[104]:
(0, 400000)

8. Which ethnicity is overall more vulnerable to deseases in NYC?

According to the number of cases reported, non-hispanic white newyorkers have the highest mortality rate.


In [76]:
df.groupby('Ethnicity')['Count'].sum().sort_values(ascending=False)


Out[76]:
Ethnicity
NON-HISPANIC WHITE          483900
NON-HISPANIC BLACK          249908
HISPANIC                    164048
ASIAN & PACIFIC ISLANDER     55292
Name: Count, dtype: int64

In [79]:
df.groupby('Ethnicity')['Count'].sum().plot.barh(color=['Black', 'Black', 'Black', 'darkred'])


Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x116f45128>

9. Is distribution between ethnicities the same? What about white newyorkers?

Data on white population seems to be very similar to the overall outcome.


In [126]:
fig, ax = plt.subplots(figsize=(9, 7))
only_whites = df[df['Ethnicity'] == 'NON-HISPANIC WHITE']
only_whites.groupby('Cause of Death')['Count'].sum().sort_values(ascending=True).plot.barh(color='blue')


Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x118e73d68>

10. Same for african americans?

Same distribution with the exception of an increase in deaths produced by diabetes type 2 and HIV


In [129]:
fig, ax = plt.subplots(figsize=(9, 7))
only_blacks = df[df['Ethnicity'] == 'NON-HISPANIC BLACK']
only_blacks.groupby('Cause of Death')['Count'].sum().sort_values(ascending=True).plot.barh(color='black')


Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x1190863c8>

11. What about spanish and asians?

Similar results for spanish population


In [132]:
fig, ax = plt.subplots(figsize=(9, 7))
only_hispanic = df[df['Ethnicity'] == 'HISPANIC']
only_hispanic.groupby('Cause of Death')['Count'].sum().sort_values(ascending=True).plot.barh(color='brown')


Out[132]:
<matplotlib.axes._subplots.AxesSubplot at 0x119330630>

Asians are by far the most different population in relations to causes of death. It has the highest incidents of death by Cancer


In [140]:
only_asians = df[df['Ethnicity'] == 'ASIAN & PACIFIC ISLANDER']

In [138]:
fig, ax = plt.subplots(figsize=(9, 7))
only_asians.groupby('Cause of Death')['Count'].sum().sort_values(ascending=True).plot.barh(color='green')


Out[138]:
<matplotlib.axes._subplots.AxesSubplot at 0x1198c5a90>

In [ ]:


In [ ]:


In [ ]: