Breaking a variable to levels

The scenario for this tutorial is that, you have a series of a variable, such as the population density of different cities. And, you need to classify them into different groups according to this variable, e.g. the very high, medium high, medium, medium low, very low population density, etc.

In some cases, you already have a GeoDataFrame/DataFrame, in other cases, you just have a list that contain the numbers. So, the following cover two major functions:

tm.leveling_vector, which take a dataframe and a column name for the classifying; and
bk.get_levels, which take a list.

The two functions takes a break_method for the breaking methods, such as quantile(default), head_tail_break, natural_break, equal_interval (and manual).

They take a break_N parameter, for specifying the number of groups.

And they also take a break_cuts.

First, import things that is needed.



In [53]:

    
import geopandas as gpd # for reading and manupulating shapefile
import matplotlib.pyplot as plt # for making figure
import seaborn as sns # for making distplot

from colouringmap import theme_mapping as tm # a function named leveling_vector in tm will be used
from colouringmap import breaking_levels as bk # a function named get_levels in bk will be used

# magic line for matlotlib figure to be shown inline in jupyter cell
%matplotlib inline

read a demo file, and take a look



In [3]:

    
grid_res = gpd.read_file('data/community_results.shp')
grid_res.head()









    Out[3]:







  
    
      
      com
      geometry
      node
      tweets
      usercount
      xcor
      ycor
    
  
  
    
      0
      14
      POLYGON ((175239.9457184017 3947195.841823581,...
      0
      1
      1
      139.939807
      35.640542
    
    
      1
      56
      POLYGON ((175239.9457767347 3947695.841815081,...
      1
      0
      0
      139.939919
      35.645048
    
    
      2
      1
      POLYGON ((142239.9457464929 3956695.841823446,...
      10
      35
      21
      139.576848
      35.731640
    
    
      3
      18
      POLYGON ((144239.9457266586 3959695.841818351,...
      100
      40
      32
      139.599535
      35.758373
    
    
      4
      4
      POLYGON ((154239.9457194024 3947195.841822605,...
      1000
      1898
      660
      139.707733
      35.644166

take a look at the data distribution. using seaborn distplot.



In [7]:

    
sns.distplot(grid_res['usercount'], kde=False)









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f9081987e90>

the above plot showed that the data is potentially an exponential distribution. so lets try to make the yscale log.



In [58]:

    
ax = sns.distplot(grid_res['usercount'], kde=False)
#ax.set_xscale("log", nonposx='clip')
ax.set_yscale("log", nonposy='clip')

using different break method:

quantile
head_tail_break
natural_break
equal_interval

the following is the most simple way of converting the column of a gdf to levels



In [29]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount') #, break_method='quantile') #default method is quantile

Normally, the level_list is used to be assign to the gdf. This is what I did in other functions of mapping.



In [18]:

    
grid_res['user_level'] = level_list



In [24]:

    
grid_res.head()









    Out[24]:







  
    
      
      com
      geometry
      node
      tweets
      usercount
      xcor
      ycor
      user_level
    
  
  
    
      0
      14
      POLYGON ((175239.9457184017 3947195.841823581,...
      0
      1
      1
      139.939807
      35.640542
      0
    
    
      1
      56
      POLYGON ((175239.9457767347 3947695.841815081,...
      1
      0
      0
      139.939919
      35.645048
      0
    
    
      2
      1
      POLYGON ((142239.9457464929 3956695.841823446,...
      10
      35
      21
      139.576848
      35.731640
      2
    
    
      3
      18
      POLYGON ((144239.9457266586 3959695.841818351,...
      100
      40
      32
      139.599535
      35.758373
      2
    
    
      4
      4
      POLYGON ((154239.9457194024 3947195.841822605,...
      1000
      1898
      660
      139.707733
      35.644166
      4

cuts contain the breaking values, and the min/max at the both end of the list.



In [30]:

    
cuts









    Out[30]:





[0.0, 5.0, 14.0, 32.0, 103.0, 4506.0]



In [31]:

    
ax = sns.distplot(grid_res['usercount'], kde=False)
#ax.set_xscale("log", nonposx='clip')
ax.set_yscale("log", nonposy='clip')
for c in cuts:
    ax.axvline(x=c)



In [32]:

    
lev = list(set(level_list))
count = [ level_list.count(l) for l in lev ]
print lev
print count









    



[0, 1, 2, 3, 4]
[568, 585, 531, 550, 554]

quantile has a similar count for each level.

Lets try some other break method.



In [33]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='head_tail_break')
print cuts









    



[0.0, 111.01004304160689, 483.8207547169811, 1173.1554054054054, 2146.409090909091, 4506.0]



In [34]:

    
ax = sns.distplot(grid_res['usercount'], kde=False)
#ax.set_xscale("log", nonposx='clip')
ax.set_yscale("log", nonposy='clip')
for c in cuts:
    ax.axvline(x=c)



In [35]:

    
lev = list(set(level_list))
count = [ level_list.count(l) for l in lev ]
print lev
print count









    



[0, 1, 2, 3, 4]
[2258, 382, 104, 28, 16]



In [36]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='natural_break')
print cuts









    



[0.0, 183.0, 644.0, 1465.0, 2677.0, 4506.0]



In [37]:

    
ax = sns.distplot(grid_res['usercount'], kde=False)
#ax.set_xscale("log", nonposx='clip')
ax.set_yscale("log", nonposy='clip')
for c in cuts:
    ax.axvline(x=c)



In [38]:

    
lev = list(set(level_list))
count = [ level_list.count(l) for l in lev ]
print lev
print count









    



[0, 1, 2, 3, 4]
[2445, 236, 78, 19, 10]



In [39]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='equal_interval')
print cuts









    



[0.0, 901.2, 1802.4, 2703.6000000000004, 3604.8, 4506.0]



In [40]:

    
ax = sns.distplot(grid_res['usercount'], kde=False)
#ax.set_xscale("log", nonposx='clip')
ax.set_yscale("log", nonposy='clip')
for c in cuts:
    ax.axvline(x=c)



In [41]:

    
lev = list(set(level_list))
count = [ level_list.count(l) for l in lev ]
print lev
print count









    



[0, 1, 2, 3, 4]
[2713, 53, 12, 3, 7]

specifying the number of level

The number of level is set to the parameter break_N, which is default to 5.

After setting the break_N to N, the number of cuts become N+1, because it contain both the largest and the smallest values.



In [43]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='head_tail_break', break_N=3)
print cuts









    



[0.0, 111.01004304160689, 483.8207547169811, 4506.0]



In [44]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='head_tail_break', break_N=5)
print cuts









    



[0.0, 111.01004304160689, 483.8207547169811, 1173.1554054054054, 2146.409090909091, 4506.0]



In [45]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='head_tail_break', break_N=7)
print cuts









    



[0.0, 111.01004304160689, 483.8207547169811, 1173.1554054054054, 2146.409090909091, 3247.6875, 3889.375, 4506.0]



In [46]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='head_tail_break', break_N=9)
print cuts









    



[0.0, 111.01004304160689, 483.8207547169811, 1173.1554054054054, 2146.409090909091, 3247.6875, 3889.375, 4475.0, 4506.0, 4506.0]

note that what head_tail_break do for increased number of levels.

specifying cuts manually

There are two ways of using the cuts. This will return a cut list, and a level_list that is in the same length and same sequence with the input vector.

using quantile as method, and the cuts are some float numbers betweent 0-1.
using manual as method, and the cuts are some user defined cuts.

NOTE that the cut list has to include the minimum and maximum values.



In [55]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='quantile', break_cuts=[0.,.25,.5,.75,1.])
print cuts









    



[0.0, 0.0, 7.0, 21.0, 70.0, 4506.0]



In [56]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='quantile', break_cuts=[0.,0.1,.5,.99,1.])
print cuts









    



[0.0, 0.0, 2.0, 21.0, 1581.7200000000048, 4506.0]



In [57]:

    
level_list, cuts = tm.leveling_vector(grid_res, 'usercount', break_method='manual', break_cuts=[0.0, 120, 490, 1200, 2200, 4506.0])
print cuts









    



[0.0, 0.0, 120, 490, 1200, 2200, 4506.0]

breaking a list instead of a column of a dataframe

Let say you have a list, instead of a dataframe/geodataframe.



In [47]:

    
a_list = grid_res['usercount'].tolist()

And you want to get the break levels, another function is also provided (the function that is called by tm.leveling_vector).



In [49]:

    
level_list, cuts = bk.get_levels(a_list, method='head_tail_break', N=5)



In [50]:

    
print cuts









    



[0.0, 183.0, 644.0, 1465.0, 2677.0, 4506.0]



In [52]:

    
len(level_list)==len(a_list)









    Out[52]:





True

The resulting level_list is in the same sequence as the input a_list.



In [ ]:

	com	geometry	node	tweets	usercount	xcor	ycor
0	14	POLYGON ((175239.9457184017 3947195.841823581,...	0	1	1	139.939807	35.640542
1	56	POLYGON ((175239.9457767347 3947695.841815081,...	1	0	0	139.939919	35.645048
2	1	POLYGON ((142239.9457464929 3956695.841823446,...	10	35	21	139.576848	35.731640
3	18	POLYGON ((144239.9457266586 3959695.841818351,...	100	40	32	139.599535	35.758373
4	4	POLYGON ((154239.9457194024 3947195.841822605,...	1000	1898	660	139.707733	35.644166