Insights from medical posts

In this notebook, I try to find characteristics of medical posts.

What is the ratio of post from professionals vs. those from general public?
What are the characteristics that well-separate professional-level posts?
- Length of text
- Usage of offending vocabulary
- Writing level



In [1]:

    
# Set up paths/ os
import os
import sys

this_path=os.getcwd()
os.chdir("../data")
sys.path.insert(0, this_path)



In [2]:

    
# Load datasets
import pandas as pd

df =  pd.read_csv("MedHelp-posts.csv",index_col=0)
df.head(2)









    Out[2]:







  
    
      
      title
      text
      href
      user id
      mother post id
    
    
      post id
      
      
      
      
      
    
  
  
    
      1
      Inappropriate Masterbation Down Syndrome
      \n        It is common for children and adoles...
      http://www.medhelp.org//posts/Autism--Asperger...
      user_340688
      1
    
    
      2
      Inappropriate Masterbation Down Syndrome
      \n        A related discussion, self injusry i...
      http://www.medhelp.org//posts/Autism--Asperger...
      user_1566928
      1



In [3]:

    
df_users = pd.read_csv("MedHelp-users.csv",index_col=0)
df_users.head(2)









    Out[3]:







  
    
      
      user description
    
    
      user id
      
    
  
  
    
      user_340688
      Rachel  Thompson, Ph.D., BCBA
    
    
      user_1566928
      CirclesLady29



In [4]:

    
# 1 classify users as professionals and general public:

df_users['is expert']=0

for user_id in df_users.index:
    user_description=df_users.loc[user_id,['user description']].values
    if ( "," in user_description[0]):
        print(user_description[0])
        df_users.loc[user_id,['is expert']]=1

# Save database:
df_users.to_csv("MedHelp-users-class.csv")









    



Rachel  Thompson, Ph.D., BCBA
Myrna  Libby, Ph.D., BCBA
Jason C Bourret, Ph.D., BCBA-D
Tali  Shenfield, PhD
Richard B. Graff, PhD, BCBA-D
Jessica L Thomason Sassi, Ph.D., BCBA-D
William L. Holcomb, Ph.D., BCBA-D
Eileen  Roscoe, PhD
Rebecca  MacDonald, Ph.D., BCBA
William H Ahearn, Ph.D., BCBA



In [5]:

    
is_expert=df_users['is expert'] == 1
is_expert.value_counts()









    Out[5]:





False    495
True      10
Name: is expert, dtype: int64

Only 10 out of 505 users are experts!
This corresponds to 2 % of users.



In [6]:

    
# Select user_id from DB where is_professional = 1
experts_ids = df_users[df_users['is expert'] == 1 ].index.values
experts_ids

non_experts_ids = df_users[df_users['is expert'] == 0 ].index.values



In [7]:

    
# Select * where user_id in experts_ids
#df_users.loc[df_users.index.isin(experts_ids)]

df_experts=df.loc[df['user id'].isin(experts_ids)]
print('Total of posts from expert users {}'.format(len(df_experts)))
print('Total of posts {}'.format(len(df)))
print('Ratio {}'.format(len(df_experts)/len(df)))
del df_experts









    



Total of posts from expert users 727
Total of posts 1813
Ratio 0.40099282956425814

Length ot text



In [8]:

    
# Tokenize data
import nltk
tokenizer = nltk.RegexpTokenizer(r'\w+')

# Get the length of tokens into a columns
df_text = df['text'].str.lower()
df_token = df_text.apply(tokenizer.tokenize)
df['token length'] = df_token.apply(len)




# Get list of tokens from text in first article:

#for text in df_text.values:
#    ttext = tokenizer.tokenize(text.lower())
#    lenght_text=len(ttext)
#    break



In [9]:

    
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.mlab as mlab
from matplotlib import gridspec
from scipy.stats import norm
import numpy as np
from scipy.optimize import curve_fit
from lognormal import lognormal, lognormal_stats,truncated_normal
from scipy.stats import truncnorm



In [10]:

    
plt.rcParams['text.usetex'] = True
plt.rcParams['text.latex.unicode'] = True

plt.rcParams.update({'font.size': 24})

nbins=100


fig = plt.figure()
#fig=plt.figure(figsize=(2,1))
#fig.set_size_inches(6.6,3.3)
gs = gridspec.GridSpec(2, 1)
#plt.subplots_adjust(left=0.1,right=1.0,bottom=0.17,top=0.9)


#plt.suptitle('Text length (words count)')
fig.text(0.04,0.5,'Distribution',va='center',rotation='vertical')









    Out[10]:





<matplotlib.text.Text at 0x1198b6748>



In [11]:

    
#X ticks

xmax=200
x=np.arange(0,xmax,10) #xtics
xx=np.arange(1,xmax,1)

# Panel 1
ax1=plt.subplot(gs[0])
ax1.set_xlim([0, xmax])
ax1.set_xticks(x)
ax1.tick_params(labelbottom='off')    


#plt.ylabel('')
#Class 0
X=df.loc[df['user id'].isin(non_experts_ids)]['token length'].values
n,bins,patches=plt.hist(X,nbins,normed=1,facecolor='cyan',align='mid')

popt,pcov = curve_fit(truncated_normal,bins[:nbins],n)
c0,=plt.plot(xx,truncated_normal(xx,*popt),color='blue',label='non expert')
plt.legend(handles=[c0],bbox_to_anchor=(0.45, 0.95), loc=2, borderaxespad=0.)

print(popt)
mu=X.mean()
var=X.var()
print("Class 0: Mean,variance: ({},{})".format(mu,var))



# Panel 2
ax2=plt.subplot(gs[1])
ax2.set_xlim([0, xmax])
ax2.set_xticks(x)
#ax2.set_yticks(np.arange(0,8,2))
#plt.ylabel('Normal distribution')

#Class 1
X=df.loc[df['user id'].isin(experts_ids)]['token length'].values
#(mu,sigma) = norm.fit(X)
n,bins,patches=plt.hist(X,nbins,normed=1,facecolor='orange',align='mid')
popt,pcov = curve_fit(lognormal,bins[:nbins],n)
#c1,=plt.plot(xx,mlab.normpdf(xx, mu, sigma),color='darkorange',label='layered')
c1,=plt.plot(xx,lognormal(xx,*popt),color='red',label='expert')
plt.legend(handles=[c1],bbox_to_anchor=(0.45, 0.95), loc=2, borderaxespad=0.)
print("Class 1: Mean,variance:",lognormal_stats(*popt))


#plt.xlabel('Volume ratio (theor./expt.)')

plt.show()









    



[ 96.97554606  52.03452703]
Class 0: Mean,variance: (176.15745856353593,37789.182389121204)
Class 1: Mean,variance: (159.83585715580054, 14762.002895195637)

#This is a useful example of truncated Gaussian fig=plt.figure() from scipy.stats import truncnorm def get_truncated_normal(mean=0, sd=1, low=0, upp=10): return truncnorm( (low - mean) / sd, (upp - mean) / sd, loc=mean, scale=sd) X1 = get_truncated_normal(mean=0, sd=1, low=1, upp=10) X2 = get_truncated_normal(mean=5.5, sd=1, low=1, upp=10) X3 = get_truncated_normal(mean=8, sd=1, low=1, upp=10) import matplotlib.pyplot as plt fig, ax = plt.subplots(3, sharex=True) ax[0].hist(X1.rvs(10000), normed=True) ax[1].hist(X2.rvs(10000), normed=True) ax[2].hist(X3.rvs(10000), normed=True) plt.show()



In [14]:

    
# What is the 5% for distribution of experts?
X=df.loc[df['user id'].isin(experts_ids)]['token length'].values
total=len(X)

for ix in range(10,500,10):
    this_sum=0
    for xx in X:
        if xx < ix:
            this_sum =  this_sum + 1
    percentile = this_sum/total * 100
    print("Value {} percentile {}".format(ix,percentile))









    



Value 10 percentile 0.1375515818431912
Value 20 percentile 0.5502063273727648
Value 30 percentile 2.200825309491059
Value 40 percentile 3.988995873452544
Value 50 percentile 6.052269601100413
Value 60 percentile 10.178817056396149
Value 70 percentile 15.955983493810177
Value 80 percentile 21.8707015130674
Value 90 percentile 27.51031636863824
Value 100 percentile 32.874828060522695
Value 110 percentile 38.101788170563964
Value 120 percentile 44.42916093535076
Value 130 percentile 49.24346629986245
Value 140 percentile 54.19532324621733
Value 150 percentile 57.49656121045392
Value 160 percentile 61.34800550206327
Value 170 percentile 65.06189821182944
Value 180 percentile 67.95048143053644
Value 190 percentile 70.56396148555709
Value 200 percentile 74.55295735900962
Value 210 percentile 76.75378266850069
Value 220 percentile 78.95460797799174
Value 230 percentile 80.19257221458047
Value 240 percentile 81.56808803301237
Value 250 percentile 82.66850068775791
Value 260 percentile 84.181568088033
Value 270 percentile 85.69463548830811
Value 280 percentile 86.38239339752407
Value 290 percentile 87.89546079779917
Value 300 percentile 88.72077028885832
Value 310 percentile 89.82118294360384
Value 320 percentile 90.37138927097662
Value 330 percentile 91.74690508940853
Value 340 percentile 92.70976616231087
Value 350 percentile 92.98486932599724
Value 360 percentile 93.8101788170564
Value 370 percentile 94.08528198074278
Value 380 percentile 94.63548830811554
Value 390 percentile 95.04814305364512
Value 400 percentile 95.3232462173315
Value 410 percentile 95.59834938101788
Value 420 percentile 95.87345254470426
Value 430 percentile 95.87345254470426
Value 440 percentile 96.01100412654745
Value 450 percentile 96.28610729023383
Value 460 percentile 96.56121045392022
Value 470 percentile 96.56121045392022
Value 480 percentile 96.56121045392022
Value 490 percentile 96.69876203576341



In [ ]:

	title	text	href	user id	mother post id
post id
1	Inappropriate Masterbation Down Syndrome	\n It is common for children and adoles...	http://www.medhelp.org//posts/Autism--Asperger...	user_340688	1
2	Inappropriate Masterbation Down Syndrome	\n A related discussion, self injusry i...	http://www.medhelp.org//posts/Autism--Asperger...	user_1566928	1

	user description
user id
user_340688	Rachel Thompson, Ph.D., BCBA
user_1566928	CirclesLady29