How many Female and male of different Age Group in the Dataset missed the Appointments ?
Did Age, regardless of age_group and sex, determine the patients missing the Appointments ?
Did women and children preferred to attend their appointments ?
Did the Scholarship of the patients helped in the attendence of their appointments?
In [84]:
# Render plots inline
%matplotlib inline
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set style for all graphs
sns.set(style="whitegrid")
# Read in the Dataset, creat dataframe
appointment_data = pd.read_csv('noshow.csv')
In [61]:
# Print the first few records to review data and format
appointment_data.head()
Out[61]:
In [62]:
# Print the Last few records to review data and format
appointment_data.tail()
Out[62]:
Note PatientId column have exponential values in it. Note No-show displays No if the patient visited and Yes if the Patient did not visited.
From the data description and questions to answer, I've determined that some of the dataset columns are not necessary for the analysis process and will therefore be removed. This will help to process the Data Analysis Faster.
i'll take a 3 step approach to data cleanup
Step 1 - Removing Duplicate entries Concluded that no duplicates entries exists, based on the tests below
In [63]:
# Identify and remove duplicate entries
appointment_data_duplicates = appointment_data.duplicated()
print 'Number of duplicate entries is/are {}'.format(appointment_data_duplicates.sum())
In [64]:
# Lets make sure that this is working
duplication_test = appointment_data.duplicated('Age').head()
print 'Number of entries with duplicate age in top entries are {}'.format(duplication_test.sum())
appointment_data.head()
Out[64]:
Step 2 - Remove unnecessary columns
Columns(PatientId, ScheduledDay, Sms_received, AppointmentID, AppointmentDay) removed
In [65]:
# Create new dataset without unwanted columns
clean_appointment_data = appointment_data.drop(['PatientId','ScheduledDay','SMS_received','AppointmentID','AppointmentDay'], axis=1)
clean_appointment_data.head()
Out[65]:
Step 3 - Fix any missing or data format issues Concluded that there is no missing data
In [66]:
# Calculate number of missing values
clean_appointment_data.isnull().sum()
Out[66]:
In [67]:
# Taking a look at the datatypes
clean_appointment_data.info()
In [68]:
# Looking at some typical descriptive statistics
clean_appointment_data.describe()
Out[68]:
In [69]:
# Age minimum at -1.0 looks a bit weird so give a closer look
clean_appointment_data[clean_appointment_data['Age'] == -1]
Out[69]:
In [70]:
# Fixing the negative value and creating a new column named Fixed_Age.
clean_appointment_data['Fixed_Age'] = clean_appointment_data['Age'].abs()
In [71]:
# Checking whether the negative value is still there or is it removed and changed into a positive value.
clean_appointment_data[clean_appointment_data['Fixed_Age'] == -1]
Out[71]:
In [72]:
# Create AgeGroups for further Analysis
'''bins = [0, 25, 50, 75, 100, 120]
group_names = ['0-25', '25-50', '50-75', '75-100', '100-120']
clean_appointment_data['age-group'] = pd.cut(clean_appointment_data['Fixed_Age'], bins, labels=group_names)
clean_appointment_data.head()'''
clean_appointment_data['Age_rounded'] = np.round(clean_appointment_data['Fixed_Age'], -1)
categories_dict = {0: '0-5',
10: '5-15',
20: '15-25',
30 : '25-35',
40 : '35-45',
50 : '45-55',
60: '55-65',
70 : '65-75',
80 : '75-85',
90: '85-95',
100: '95-105',
120: '105-115'}
clean_appointment_data['age_group'] = clean_appointment_data['Age_rounded'].map(categories_dict)
In [73]:
clean_appointment_data['age_group']
Out[73]:
Creation and Addition of Age_Group in the data set will help in the Q1 - How many Female and male of different Age Group in the Dataset missed the Appointments ?
In [74]:
# Simplifying the analysis by Fixing Yes and No issue in the No-show
# The issue is that in the No-show No means that the person visited at the time of their appointment and Yes means that they did not visited.
# First I will change Yes to 0 and No to 1 so that there is no confusion
clean_appointment_data['people_showed_up'] = clean_appointment_data['No-show'].replace(['Yes', 'No'], [0, 1])
clean_appointment_data
Out[74]:
In [75]:
# Taking a look at the age of people who showed up and those who missed the appointment
youngest_to_showup = clean_appointment_data[clean_appointment_data['people_showed_up'] == True]['Fixed_Age'].min()
youngest_to_miss = clean_appointment_data[clean_appointment_data['people_showed_up'] == False]['Fixed_Age'].min()
oldest_to_showup = clean_appointment_data[clean_appointment_data['people_showed_up'] == True]['Fixed_Age'].max()
oldest_to_miss = clean_appointment_data[clean_appointment_data['people_showed_up'] == False]['Fixed_Age'].max()
print 'Youngest to Show up: {} \nYoungest to Miss: {} \nOldest to Show Up: {} \nOldest to Miss: {}'.format(
youngest_to_showup, youngest_to_miss, oldest_to_showup, oldest_to_miss)
In [76]:
# Returns the percentage of male and female who visited the
# hospital on their appointment day with their Age
def people_visited(age_group, gender):
grouped_by_total = clean_appointment_data.groupby(['age_group', 'Gender']).size()[age_group,gender].astype('float')
grouped_by_visiting_gender = \
clean_appointment_data.groupby(['age_group', 'people_showed_up', 'Gender']).size()[age_group,1,gender].astype('float')
visited_gender_pct = (grouped_by_visiting_gender / grouped_by_total * 100).round(2)
return visited_gender_pct
In [77]:
# Get the actual numbers grouped by Age, No-show, Gender
groupedby_visitors = clean_appointment_data.groupby(['age_group','people_showed_up','Gender']).size()
# Print - Grouped by Age Group, Patients showing up on thier appointments and Gender
print groupedby_visitors
print '0-5 - Female Appointment Attendence: {}%'.format(people_visited('0-5','F'))
print '0-5 - Male Appointment Attendence: {}%'.format(people_visited('0-5','M'))
print '5-15 - Female Appointment Attendence: {}%'.format(people_visited('5-15','F'))
print '5-15 - Male Appointment Attendence: {}%'.format(people_visited('5-15','M'))
print '15-25 - Female Appointment Attendence: {}%'.format(people_visited('15-25','F'))
print '15-25 - Male Appointment Attendence: {}%'.format(people_visited('15-25','M'))
print '25-35 - Female Appointment Attendence: {}%'.format(people_visited('25-35','F'))
print '25-35 - Male Appointment Attendence: {}%'.format(people_visited('25-35','M'))
print '35-45 - Female Appointment Attendence: {}%'.format(people_visited('35-45','F'))
print '35-45 - Male Appointment Attendence: {}%'.format(people_visited('35-45','M'))
print '45-55 - Female Appointment Attendence: {}%'.format(people_visited('45-55','F'))
print '45-55 - Male Appointment Attendence: {}%'.format(people_visited('45-55','M'))
print '55-65 - Female Appointment Attendence: {}%'.format(people_visited('55-65','F'))
print '55-65 - Male Appointment Attendence: {}%'.format(people_visited('55-65','M'))
print '65-75 - Female Appointment Attendence: {}%'.format(people_visited('65-75','F'))
print '65-75 - Male Appointment Attendence: {}%'.format(people_visited('65-75','M'))
print '75-85 - Female Appointment Attendence: {}%'.format(people_visited('75-85','F'))
print '75-85 - Male Appointment Attendence: {}%'.format(people_visited('75-85','M'))
print '85-95 - Female Appointment Attendence: {}%'.format(people_visited('85-95','F'))
print '85-95 - Male Appointment Attendence: {}%'.format(people_visited('85-95','M'))
print '95-105 - Female Appointment Attendence: {}%'.format(people_visited('95-105','F'))
print '95-105 - Male Appointment Attendence: {}%'.format(people_visited('95-105','M'))
print '105-115 - Female Appointment Attendence: {}%'.format(people_visited('105-115','F'))
In [78]:
# Graph - Grouped by class, survival and sex
g = sns.factorplot(x="Gender", y="people_showed_up", col="age_group", data=clean_appointment_data,
saturation=4, kind="bar", ci=None, size=12, aspect=.35)
# Fix up the labels
(g.set_axis_labels('', 'People Visited')
.set_xticklabels(["Men", "Women"], fontsize = 30)
.set_titles("Age Group {col_name}")
.set(ylim=(0, 1))
.despine(left=True, bottom=True))
Out[78]:
The graph above shows the number of people who attended their appointment and those who did not attended their appointments acccording to the Gender of the people having the appointment in the hospital.
In [79]:
# Graph - Actual count of passengers by survival, group and sex
g = sns.factorplot('people_showed_up', col='Gender', hue='age_group', data=clean_appointment_data, kind='count', size=15, aspect=.6)
# Fix up the labels
(g.set_axis_labels('People Who Attended', 'No. of Appointment')
.set_xticklabels(["False", "True"], fontsize=20)
.set_titles('{col_name}')
)
titles = ['Men', 'Women']
for ax, title in zip(g.axes.flat, titles):
ax.set_title(title)
The graph above shows the number of people who attended their appointment and those who did not attended their appointments.
The graphs is categorized according to the Age Group.
Based on the raw numbers it would appear that the age group of 65-75 is the most health cautious Age Group because they have the highest percentage of Appointment attendence followed by the Age Group of 55-65 which is just about 1% less than the 65-75 Age Group in the Appointment Attendence.
The Age group with the least percentage of Appointment Attendence is 15-25.
Note 105-115 is not the least percentage age group because the number of patients in that Age group are too low. So, the comparision is not possible.
Did Age, regardless of Gender, determine the patients missing the Appointments ?
In [85]:
# Find the total number of people who showed up and those who missed their appointments
number_showed_up = clean_appointment_data[clean_appointment_data['people_showed_up'] == True]['people_showed_up'].count()
number_missed = clean_appointment_data[clean_appointment_data['people_showed_up'] == False]['people_showed_up'].count()
# Find the average number of people who showed up and those who missed their appointments
mean_age_showed_up = clean_appointment_data[clean_appointment_data['people_showed_up'] == True]['Age'].mean()
mean_age_missed = clean_appointment_data[clean_appointment_data['people_showed_up'] == False]['Age'].mean()
# Displaying a few Totals
print 'Total number of People Who Showed Up {} \n\
Total number of People who missed the appointment {} \n\
Mean age of people who Showed up {} \n\
Mean age of people who missed the appointment {} \n\
Oldest to show up {} \n\
Oldest to miss the appointment {}' \
.format(number_showed_up, number_missed, np.round(mean_age_showed_up),
np.round(mean_age_missed), oldest_to_showup, oldest_to_miss)
# Graph of age of passengers across sex of those who survived
g = sns.factorplot(x="people_showed_up", y="Fixed_Age", hue='Gender', data=clean_appointment_data, kind="box", size=7, aspect=.8)
# Fixing the labels
(g.set_axis_labels('Appointment Attendence', 'Age of Patients')
.set_xticklabels(["False", "True"])
)
Out[85]:
Based on the boxplot and the calculated data above, it would appear that:
Did women and children preferred to attend their appointments ?
Assumption: With 'child' not classified in the data, I'll need to assume a cutoff point. Therefore, I'll be using today's standard of under 18 as those to be considered as a child vs adult.
In [86]:
# Create Category and Categorize people
clean_appointment_data.loc[
((clean_appointment_data['Gender'] == 'F') &
(clean_appointment_data['Age'] >= 18)),
'Category'] = 'Woman'
clean_appointment_data.loc[
((clean_appointment_data['Gender'] == 'M') &
(clean_appointment_data['Age'] >= 18)),
'Category'] = 'Man'
clean_appointment_data.loc[
(clean_appointment_data['Age'] < 18),
'Category'] = 'Child'
# Get the totals grouped by Men, Women and Children
print clean_appointment_data.groupby(['Category', 'people_showed_up']).size()
# Graph - Comapre the number of Men, Women and Children who showed up on their appointments
g = sns.factorplot('people_showed_up', col='Category', data=clean_appointment_data, kind='count', size=7, aspect=0.8)
# Fix up the labels
(g.set_axis_labels('Appointment Attendence', 'No. of Patients')
.set_xticklabels(['False', 'True'])
)
titles = ['Women', 'Men', 'Children']
for ax, title in zip(g.axes.flat, titles):
ax.set_title(title)
Based on the calculated data and the Graphs, it would appear that:
Did the Scholarship of the patients helped in the attendence of their appointments?
In [87]:
# Determine the number of Man, Woman and Children who had scholarship
man_with_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Man') &
(clean_appointment_data['Scholarship'] == 1)]
man_without_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Man') &
(clean_appointment_data['Scholarship'] == 0)]
woman_with_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Woman') &
(clean_appointment_data['Scholarship'] == 1)]
woman_without_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Woman') &
(clean_appointment_data['Scholarship'] == 0)]
children_with_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Child') &
(clean_appointment_data['Scholarship'] == 1)]
children_without_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Child') &
(clean_appointment_data['Scholarship'] == 0)]
# Graph - Compare how many man, woman and children with or without scholarship attended thier apoointments
g = sns.factorplot('Scholarship', col='Category', data=clean_appointment_data, kind='count', size=8, aspect=0.3)
# Fix up the labels
(g.set_axis_labels('Scholarship', 'No of Patients')
.set_xticklabels(['Missed', 'Attended'])
)
titles = ['Women', 'Men', 'Children']
for ax, title in zip(g.axes.flat, titles):
ax.set_title(title)
According to the Bar graph above :-
The conclusion is that the Scholarship did not encouraged the number of people attending their appointments regardless of their age or gender.
In [83]:
# Determine the Total Number of Men, Women and Children with Scholarship
total_male_with_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Man') &
(clean_appointment_data['Scholarship'] < 2)]
total_female_with_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Woman') &
(clean_appointment_data['Scholarship'] < 2)]
total_child_with_scholarship = clean_appointment_data.loc[
(clean_appointment_data['Category'] == 'Child') &
(clean_appointment_data['Scholarship'] < 2)]
total_man_with_scholarship = total_male_with_scholarship.Scholarship.count()
total_woman_with_scholarship = total_female_with_scholarship.Scholarship.count()
total_children_with_scholarship = total_child_with_scholarship.Scholarship.count()
# Determine the number of Men, Women and Children with scholarship who Attended the Appointments
man_with_scholarship_attendence = man_with_scholarship.Scholarship.count()
woman_with_scholarship_attendence = woman_with_scholarship.Scholarship.sum()
children_with_scholarship_attendence = children_with_scholarship.Scholarship.sum()
# Determine the Percentage of Men, Women and Children with Scholarship who Attended or Missed the Appointments
pct_man_with_scholarship_attendence = ((float(man_with_scholarship_attendence)/total_man_with_scholarship)*100)
pct_man_with_scholarship_attendence = np.round(pct_man_with_scholarship_attendence,2)
pct_woman_with_scholarship_attendence = ((float(woman_with_scholarship_attendence)/total_woman_with_scholarship)*100)
pct_woman_with_scholarship_attendence = np.round(pct_woman_with_scholarship_attendence,2)
pct_children_with_scholarship_attendence = ((float(children_with_scholarship_attendence)/total_children_with_scholarship)*100)
pct_children_with_scholarship_attendence = np.round(pct_children_with_scholarship_attendence,2)
# Determine the Average Age of Men, Women and Children with Scholarship who Attended or Missed the Appointments
man_with_scholarship_avg_age = np.round(man_with_scholarship.Age.mean())
woman_with_scholarship_avg_age = np.round(woman_with_scholarship.Age.mean())
children_with_scholarship_avg_age = np.round(children_with_scholarship.Age.mean())
# Display Results
print '1. Total number of Men with Scholarship: {}\n\
2. Total number of Women with Scholarship: {}\n\
3. Total number of Children with Scholarship: {}\n\
4. Men with Scholarship who attended the Appointment: {}\n\
5. Women with Scholarship who attended the Appointment: {}\n\
6. Children with Scholarship who attended the Appointment: {}\n\
7. Men with Scholarship who missed the Appointment: {}\n\
8. Women with Scholarship who missed the Appointment: {}\n\
9. Children with Scholarship who missed the Appointment: {}\n\
10. Percentage of Men with Scholarship who attended the Appointment: {}%\n\
11. Percentage of Women with Scholarship who attended the Appointment: {}%\n\
12. Percentage of Children with Scholarship who attended the Appointment: {}%\n\
13. Average Age of Men with Scholarship who attended the Appointment: {}\n\
14. Average Age of Women with Scholarship who attended the Appointment: {}\n\
15. Average Age of Children with Scholarship who attended the Appointment: {}'\
.format(total_man_with_scholarship, total_woman_with_scholarship, total_children_with_scholarship,
man_with_scholarship_attendence, woman_with_scholarship_attendence, children_with_scholarship_attendence,
total_man_with_scholarship-man_with_scholarship_attendence, total_woman_with_scholarship-woman_with_scholarship_attendence,
total_children_with_scholarship-children_with_scholarship_attendence,
pct_man_with_scholarship_attendence, pct_woman_with_scholarship_attendence, pct_children_with_scholarship_attendence,
man_with_scholarship_avg_age, woman_with_scholarship_avg_age, children_with_scholarship_avg_age)
Based on the Data Analysis above , it would appear that the percentage of women with scholarship i.e 12.31% is the highest among the percentage of men and children i.e 2.64%, 11.18% repectively. The differnce between the women with schorlarship attending the appointments is very high about 9.67%
However, in the average age the men have the highest age in i.e 43 Years whereas the average age of women and children attending the appointment is 39 Years and 9 Years repectively.
The results of the data analysis, would appear that Female are more health cautious whereas the health of Men and Children is neglected as they may not be taking their health seriously. Age did not seem to be a major factor. While Men neglected their health the most by skipping thier appointment dates with the Hospital. Therefore, The Gender is the most important factor in this Data Anaylsis.
However, there were many values that were not included in the Dataset and which could helped the Data Analysis to be better. The Dataset could have included the Distance of the patient's Neighbourhood from the Hospital. It could have helped to analyse the data from the neighbourhood point of view also i.e the distance a patient have to travel in order to attend their appointments.
The Age values included a negative value which created problem in Analysing the Dataset so the negative value was changed into positive value by using the abs() function in order to Analyse the Data efficiently.