Notebook Name: AppendMicrosoftAIData

Author: Sreejith Menon (smenon8@uic.edu)

General Description:

Microsoft Image Tagging API generates a bag of words that can be used to describe a image.
Think of it as, the words (nouns) you will use to describe the image to a person who cannot see the image. Each word that is returned has an associated confidence associated with the prediction. Tags with low confidence will be not considered(or ignored). For the purpose of experiment 2, the confidence level has hardcoded to 0.5.

This notebook has code that will take the API data which has been already parsed into a JSON file and joins it with the share proportion results from Amazon Mechanical Turk albums.

The idea is to check if occurence of a certain word influnces the share rate in any way.



In [21]:

    
import csv
import json
import JobsMapResultsFilesToContainerObjs as ImageMap
import DeriveFinalResultSet as drs
import DataStructsHelper as DS
import importlib
import pandas as pd
import htmltag as HT
from collections import OrderedDict
#import matplotlib.pyplot as plt
import plotly.plotly as py
import cufflinks as cf # this is necessary to link pandas to plotly
cf.go_online()
flName = "../data/All_Zebra_Count_Tag_Output_Results.txt"
pd.set_option('display.max_colwidth', -1)
imgAlbumDict = ImageMap.genImgAlbumDictFromMap(drs.imgJobMap)
master = ImageMap.createResultDict(1,100)
imgShareNotShareList,noResponse = ImageMap.imgShareCountsPerAlbum(imgAlbumDict,master)
importlib.reload(ImageMap)
importlib.reload(DS)









    Out[21]:





<module 'DataStructsHelper' from '/Users/sreejithmenon/Google Drive/Project/AnimalPhotoBias/script/DataStructsHelper.py'>

Block of code for building rank list of images shared in the descending order of their share rates Appended with Microsoft Image Tagging API results

The output is a rank list of all the images by their share rates along with the tags against every image. There is a capability to display the actual images as well alongside the rank-list.

Known issue - The '<' and '>' characters in the HTML tags in URL are often intepreted as is. Future - make sure to add escape logic for these characters in HTML tags. There are opportunities to convert some of these code blocks into methods.



In [23]:

    
header,rnkFlLst = DS.genlstTupFrmCsv("../FinalResults/rankListImages_expt2.csv")
rnkListDf = pd.DataFrame(rnkFlLst,columns=header)
rnkListDf['Proportion'] = rnkListDf['Proportion'].astype('float')
rnkListDf.sort_values(by="Proportion",ascending=False,inplace=True)

# create an overall giant csv
gidFtrs = ImageMap.genMSAIDataHighConfidenceTags("../data/GZC_data_tagged.json",0.5)
        
gidFtrsLst = DS.cnvrtDictToLstTup(gidFtrs)
df = pd.DataFrame(gidFtrsLst,columns=['GID','tags'])

shrPropsTags = pd.merge(rnkListDf,df,left_on='GID',right_on='GID')

shrPropsTags.to_csv("../FinalResults/resultsExpt2RankList_Tags.csv",index=False)
shrPropsTags['URL'] = '<img src = "https://socialmediabias.blob.core.windows.net/wildlifephotos/All_Zebra_Count_Images/' + shrPropsTags['GID'] + '.jpeg" width = "350">'

shrPropsTags.sort_values(by=['Proportion','GID'],ascending=False,inplace=True)
fullFl = HT.html(HT.body(HT.HTML(shrPropsTags.to_html(bold_rows = False,index=False))))

fullFl
outputFile = open("../FinalResults/resultsExpt2RankList_Tags.html","w")
outputFile.write(fullFl)
outputFile.close()



In [24]:

    
tgsShrNoShrCount = {}
for lst in rnkFlLst:
    tgs = gidFtrs[lst[0]]
    tmpDict = {'share': int(lst[1]), 'not_share': int(lst[2]), 'total' : int(lst[3])}
    for tag in tgs:
        oldDict ={}
        oldDict =  tgsShrNoShrCount.get(tag,{'share' : 0,'not_share' : 0,'total' : 0})
        oldDict['share'] = oldDict.get('share',0) + tmpDict['share']
        oldDict['not_share'] = oldDict.get('not_share',0) + tmpDict['not_share']
        oldDict['total'] = oldDict.get('total',0) + tmpDict['total']

        tgsShrNoShrCount[tag] = oldDict



In [5]:

    
## Append data into data frames and build visualizations
tgsShrCntDf = pd.DataFrame(tgsShrNoShrCount).transpose()
tgsShrCntDf['proportion'] = tgsShrCntDf['share'] * 100 / tgsShrCntDf['total']
tgsShrCntDf.sort_values(by=['proportion','share'],ascending=False,inplace=True)
tgsShrCntDf = tgsShrCntDf[['share','not_share','total','proportion']]
tgsShrCntDf.to_csv("../FinalResults/RankListTags.csv")

fullFl = HT.html(HT.body(HT.HTML(tgsShrCntDf.to_html(bold_rows = False))))

outputFile = open("../FinalResults/RankListTags.html","w")
outputFile.write(fullFl)
outputFile.close()



In [20]:

    
iFrameBlock = []
fig = tgsShrCntDf['proportion'].iplot(kind='line',filename="All_Tags",title="Distribution of Tags")
iFrameBlock.append(fig.embed_code)
#plt.savefig("../FinalResults/RankListTags.png",bbox_inches='tight')

Notebook Name: AppendMicrosoftAIData

Author: Sreejith Menon (smenon8@uic.edu)

General Description:

Rank list of images by share rates with Microsoft Image Tagging API output

Generate rank list of tags by share rate.