Pandas 피봇과 그룹 연산

피봇 테이블

피봇 테이블(pivot table)이란 데이터 열(column) 중에서 두 개를 키(key)로 사용하여 데이터를 선택하는 방법을 말한다.

피봇 테이블을 사용하기 위해서는 키가 될 수 있는 두 개의 열(column) 혹은 필드(field)를 선택하여 이 두 열을

행 인덱스 (row index)
열 인덱스 (column index)

로 변경해야 한다.

pivot 메서드를 사용하면 행 인덱스, 열 인덱스, 자료가 될 3가지의 열(column)을 지정할 수 있다.



In [1]:

    
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 2.5, 3.0, 2.5, 3.5]
}
df = pd.DataFrame(data, columns=["state", "year", "pop"])
df



In [2]:

    
df.pivot("state", "year", "pop")

행 인덱스와, 열 인덱스가 될 자료는 키(key)의 역할을 해야 한다. 즉, 이 값으로 데이터가 유일하게(unique) 결정되어야 한다.



In [5]:

    
df.pivot("year", "pop", "state")









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-749f5b6847f3> in <module>()
----> 1 df.pivot("year", "pop", "state")

C:\Anaconda3\lib\site-packages\pandas\core\frame.py in pivot(self, index, columns, values)
   3844         """
   3845         from pandas.core.reshape import pivot
-> 3846         return pivot(self, index=index, columns=columns, values=values)
   3847 
   3848     def stack(self, level=-1, dropna=True):

C:\Anaconda3\lib\site-packages\pandas\core\reshape.py in pivot(self, index, columns, values)
    330         indexed = Series(self[values].values,
    331                          index=MultiIndex.from_arrays([index, self[columns]]))
--> 332         return indexed.unstack(columns)
    333 
    334 

C:\Anaconda3\lib\site-packages\pandas\core\series.py in unstack(self, level, fill_value)
   2041         """
   2042         from pandas.core.reshape import unstack
-> 2043         return unstack(self, level, fill_value)
   2044 
   2045     # ----------------------------------------------------------------------

C:\Anaconda3\lib\site-packages\pandas\core\reshape.py in unstack(obj, level, fill_value)
    405     else:
    406         unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 407                                fill_value=fill_value)
    408         return unstacker.get_result()
    409 

C:\Anaconda3\lib\site-packages\pandas\core\reshape.py in __init__(self, values, index, level, value_columns, fill_value)
     99 
    100         self._make_sorted_values_labels()
--> 101         self._make_selectors()
    102 
    103     def _make_sorted_values_labels(self):

C:\Anaconda3\lib\site-packages\pandas\core\reshape.py in _make_selectors(self)
    137 
    138         if mask.sum() < len(self.index):
--> 139             raise ValueError('Index contains duplicate entries, '
    140                              'cannot reshape')
    141 

ValueError: Index contains duplicate entries, cannot reshape



In [4]:

    
df.set_index(["state", "year"])



In [5]:

    
df.set_index(["state", "year"]).unstack()

그룹 연산

그룹 연산은 피봇 테이블과 달리 키에 의해서 결정되는 데이터가 복수개가 있어도 괜찮다. 대신 연산을 통해 복수개의 그룹 데이터에 대한 대표값을 정한다. 이를 split-apply-combine 연산이라고도 한다.

split 단계
- 특정 Key 값에 따라 데이터 그룹을 만든다.

apply 단계
- 각각의 그룹에 대해 원하는 연산을 하여 대표값을 생성한다.
  - count(), mean(), median(), min(), max()
  - sum(), prod(), std(), var(), quantile()
  - first(), last()
combine 단계
- 그룹의 Key 값에 대해 원하는 연산의 결과를 Value로 지정한 dict를 생성한다.



In [6]:

    
np.random.seed(0)
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})
df

문제: key1 값에 따른 data1의 평균은?



In [7]:

    
df.data1.groupby(df.key1).mean()









    Out[7]:





key1
a    1.343923
b    1.609816
Name: data1, dtype: float64



In [8]:

    
gs = df.data1.groupby(df.key1)
gs









    Out[8]:





<pandas.core.groupby.SeriesGroupBy object at 0x000000000A4E0940>



In [9]:

    
print("="*50)
for n, g in gs:
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*50)
    print(g)
    print("-"*50)
    print("[mean]:", g.mean())
    print("="*50)









    



==================================================
[key]: a
[group]: <class 'pandas.core.series.Series'>
--------------------------------------------------
0    1.764052
1    0.400157
4    1.867558
Name: data1, dtype: float64
--------------------------------------------------
[mean]: 1.343922514828285
==================================================
[key]: b
[group]: <class 'pandas.core.series.Series'>
--------------------------------------------------
2    0.978738
3    2.240893
Name: data1, dtype: float64
--------------------------------------------------
[mean]: 1.6098155916535986
==================================================



In [10]:

    
gs.mean()









    Out[10]:





key1
a    1.343923
b    1.609816
Name: data1, dtype: float64

문제: 복합 key (key1, key2) 값에 따른 data1의 평균은?



In [11]:

    
means = df.data1.groupby([df.key1, df.key2]).mean()
means









    Out[11]:





key1  key2
a     one     1.815805
      two     0.400157
b     one     0.978738
      two     2.240893
Name: data1, dtype: float64



In [10]:

    
means = df.data1.groupby([df.key1, df.key2]).mean()
means









    Out[10]:





key1  key2
a     one     1.815805
      two     0.400157
b     one     0.978738
      two     2.240893
Name: data1, dtype: float64

groupby 명령의 인수

groupby 명령에서 Key 인수로 입력할 수 있는 값은 다음과 같다.
- 열 또는 열의 리스트
- 행 인덱스
- 사전/함수: Column의 값을 사전에 매핑(mapping)하거나 함수 처리하여 나온 결괏값을 키로 인식



In [15]:

    
np.random.seed(0)
people = pd.DataFrame(np.random.randn(5,5),
                      columns=['a','b','c','d','e'],
                      index=['Joe','Steve','Wes','Jim','Travis'])
people.ix[2:3, ['b', 'c']] = np.nan
people



In [16]:

    
print("="*80)
for n, g in people.groupby(people.index):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)









    



================================================================================
[key]: Jim
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a         b         c         d         e
Jim  0.333674  1.494079 -0.205158  0.313068 -0.854096
================================================================================
[key]: Joe
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a         b         c         d         e
Joe  1.764052  0.400157  0.978738  2.240893  1.867558
================================================================================
[key]: Steve
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
              a         b         c         d         e
Steve -0.977278  0.950088 -0.151357 -0.103219  0.410599
================================================================================
[key]: Travis
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
              a         b         c         d         e
Travis -2.55299  0.653619  0.864436 -0.742165  2.269755
================================================================================
[key]: Wes
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a   b   c         d         e
Wes  0.144044 NaN NaN  0.121675  0.443863
================================================================================



In [17]:

    
mapping = {'Joe': 'J', 'Jim': 'J', 'Steve': 'S', 'Wes': 'S', 'Travis': 'S'}
print("="*80)
for n, g in people.groupby(mapping):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)









    



================================================================================
[key]: J
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a         b         c         d         e
Joe  1.764052  0.400157  0.978738  2.240893  1.867558
Jim  0.333674  1.494079 -0.205158  0.313068 -0.854096
================================================================================
[key]: S
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               a         b         c         d         e
Steve  -0.977278  0.950088 -0.151357 -0.103219  0.410599
Wes     0.144044       NaN       NaN  0.121675  0.443863
Travis -2.552990  0.653619  0.864436 -0.742165  2.269755
================================================================================



In [18]:

    
cap1 = lambda x: x[0].upper()
print("="*80)
for n, g in people.groupby(cap1):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)









    



================================================================================
[key]: J
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a         b         c         d         e
Joe  1.764052  0.400157  0.978738  2.240893  1.867558
Jim  0.333674  1.494079 -0.205158  0.313068 -0.854096
================================================================================
[key]: S
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
              a         b         c         d         e
Steve -0.977278  0.950088 -0.151357 -0.103219  0.410599
================================================================================
[key]: T
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
              a         b         c         d         e
Travis -2.55299  0.653619  0.864436 -0.742165  2.269755
================================================================================
[key]: W
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
            a   b   c         d         e
Wes  0.144044 NaN NaN  0.121675  0.443863
================================================================================



In [21]:

    
print("="*80)
for n, g in people.groupby(people.columns, axis=1):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)









    



================================================================================
[key]: a
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               a
Joe     1.764052
Steve  -0.977278
Wes     0.144044
Jim     0.333674
Travis -2.552990
================================================================================
[key]: b
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               b
Joe     0.400157
Steve   0.950088
Wes          NaN
Jim     1.494079
Travis  0.653619
================================================================================
[key]: c
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               c
Joe     0.978738
Steve  -0.151357
Wes          NaN
Jim    -0.205158
Travis  0.864436
================================================================================
[key]: d
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               d
Joe     2.240893
Steve  -0.103219
Wes     0.121675
Jim     0.313068
Travis -0.742165
================================================================================
[key]: e
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               e
Joe     1.867558
Steve   0.410599
Wes     0.443863
Jim    -0.854096
Travis  2.269755
================================================================================



In [22]:

    
mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}
for n, g in people.groupby(mapping, axis=1):
    print("[key]:", n)
    print("[group]:", type(g))
    print("-"*80)
    print(g)
    print("="*80)









    



[key]: blue
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               c         d
Joe     0.978738  2.240893
Steve  -0.151357 -0.103219
Wes          NaN  0.121675
Jim    -0.205158  0.313068
Travis  0.864436 -0.742165
================================================================================
[key]: red
[group]: <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------
               a         b         e
Joe     1.764052  0.400157  1.867558
Steve  -0.977278  0.950088  0.410599
Wes     0.144044       NaN  0.443863
Jim     0.333674  1.494079 -0.854096
Travis -2.552990  0.653619  2.269755
================================================================================

특별한 group 별 연산

통계
- describe()
그룹을 대표하는 하나의 값을 계산
- agg(), aggregate()
대표값으로 필드를 교체
- transform()
그룹 전체를 변형하는 계산
- apply()

TIP 데이터 예제



In [23]:

    
%cd /home/dockeruser/data/pydata-book-master









    



[WinError 3] 지정된 경로를 찾을 수 없습니다: '/home/dockeruser/data/pydata-book-master'
C:\Users\Administrator\Documents\@수학자료\160502월_1일차_분석 환경, 소개



In [27]:

    
tips = pd.read_csv('../../pydata-book-master/ch08/tips.csv')



In [28]:

    
tips.head()









    Out[28]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4



In [32]:

    
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips.tail()









    Out[32]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_pct
    
  
  
    
      239
      29.03
      5.92
      Male
      No
      Sat
      Dinner
      3
      0.203927
    
    
      240
      27.18
      2.00
      Female
      Yes
      Sat
      Dinner
      2
      0.073584
    
    
      241
      22.67
      2.00
      Male
      Yes
      Sat
      Dinner
      2
      0.088222
    
    
      242
      17.82
      1.75
      Male
      No
      Sat
      Dinner
      2
      0.098204
    
    
      243
      18.78
      3.00
      Female
      No
      Thur
      Dinner
      2
      0.159744



In [33]:

    
tips.describe()

그룹별 통계



In [34]:

    
tips.groupby(["sex", "smoker"])[["tip", "tip_pct"]].describe()

그룹별 연산



In [35]:

    
gs = tips.groupby(["sex", "smoker"])
gs_pct = gs["tip_pct"]



In [36]:

    
gs_pct.mean()









    Out[36]:





sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64



In [37]:

    
gs_pct.agg('mean')









    Out[37]:





sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64



In [38]:

    
def peak_to_peak(arr):
    return arr.max() - arr.min()

gs_pct.agg(['mean', 'std', peak_to_peak])









    Out[38]:






  
    
      
      
      mean
      std
      peak_to_peak
    
    
      sex
      smoker
      
      
      
    
  
  
    
      Female
      No
      0.156921
      0.036421
      0.195876
    
    
      Yes
      0.182150
      0.071595
      0.360233
    
    
      Male
      No
      0.160669
      0.041849
      0.220186
    
    
      Yes
      0.152771
      0.090588
      0.674707



In [39]:

    
gs.agg({'tip_pct': 'mean', 'total_bill': peak_to_peak})

그룹의 값을 대표값으로 대체



In [40]:

    
gs.agg("mean")



In [41]:

    
tips2 = tips.copy()
tips2["tips"] = gs.transform("mean")["tip_pct"]
tips2.tail(15)

그룹 자체를 대체

apply 메소드는 수치값이 아닌 Group을 출력
단순히 대표값을 계산하는 것 뿐 아니라
순서 정렬, 일부 삭제 등 그룹 내의 레코드 자체를 변형하는 것도 가능



In [42]:

    
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]



In [43]:

    
top(tips, n=6)









    Out[43]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_pct
    
  
  
    
      109
      14.31
      4.00
      Female
      Yes
      Sat
      Dinner
      2
      0.279525
    
    
      183
      23.17
      6.50
      Male
      Yes
      Sun
      Dinner
      4
      0.280535
    
    
      232
      11.61
      3.39
      Male
      No
      Sat
      Dinner
      2
      0.291990
    
    
      67
      3.07
      1.00
      Female
      Yes
      Sat
      Dinner
      1
      0.325733
    
    
      178
      9.60
      4.00
      Female
      Yes
      Sun
      Dinner
      2
      0.416667
    
    
      172
      7.25
      5.15
      Male
      Yes
      Sun
      Dinner
      2
      0.710345



In [44]:

    
tips.groupby('smoker').apply(top)



In [45]:

    
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')



In [46]:

    
f = lambda x: x.describe()
tips.groupby(['smoker']).apply(f)

pivot_table

pivot 명령과 groupby 명령의 중간적 성격
pivot을 수행하지만 데이터가 유니크하게 선택되지 않으면 aggfunc 인수로 정의된 함수를 수행하여 대표값 계산
디폴트 aggfunc 은 평균 계산



In [47]:

    
tips.pivot_table(index=['sex', 'smoker'])



In [48]:

    
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns='smoker')



In [49]:

    
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker', margins=True)



In [50]:

    
tips.pivot_table('tip_pct', index=['sex', 'smoker'], columns='day',
                 aggfunc=len, margins=True)



In [51]:

    
tips.pivot_table('size', index=['time', 'sex', 'smoker'],
                 columns='day', aggfunc='sum', fill_value=0)

	total_bill	tip	size	tip_pct
count	244.000000	244.000000	244.000000	244.000000
mean	19.785943	2.998279	2.569672	0.160803
std	8.902412	1.383638	0.951100	0.061072
min	3.070000	1.000000	1.000000	0.035638
25%	13.347500	2.000000	2.000000	0.129127
50%	17.795000	2.900000	2.000000	0.154770
75%	24.127500	3.562500	3.000000	0.191475
max	50.810000	10.000000	6.000000	0.710345

			tip	tip_pct
sex	smoker
Female	No	count	54.000000	54.000000
		mean	2.773519	0.156921
		std	1.128425	0.036421
		min	1.000000	0.056797
		25%	2.000000	0.139708
		50%	2.680000	0.149691
		75%	3.437500	0.181630
		max	5.200000	0.252672
	Yes	count	33.000000	33.000000
		mean	2.931515	0.182150
		std	1.219916	0.071595
		min	1.000000	0.056433
		25%	2.000000	0.152439
		50%	2.880000	0.173913
		75%	3.500000	0.198216
		max	6.500000	0.416667
Male	No	count	97.000000	97.000000
		mean	3.113402	0.160669
		std	1.489559	0.041849
		min	1.250000	0.071804
		25%	2.000000	0.131810
		50%	2.740000	0.157604
		75%	3.710000	0.186220
		max	9.000000	0.291990
	Yes	count	60.000000	60.000000
		mean	3.051167	0.152771
		std	1.500120	0.090588
		min	1.000000	0.035638
		25%	2.000000	0.101845
		50%	3.000000	0.141015
		75%	3.820000	0.191697
		max	10.000000	0.710345

		total_bill	tip	size	tip_pct
smoker
No	count	151.000000	151.000000	151.000000	151.000000
	mean	19.188278	2.991854	2.668874	0.159328
	std	8.255582	1.377190	1.017984	0.039910
	min	7.250000	1.000000	1.000000	0.056797
	25%	13.325000	2.000000	2.000000	0.136906
	50%	17.590000	2.740000	2.000000	0.155625
	75%	22.755000	3.505000	3.000000	0.185014
	max	48.330000	9.000000	6.000000	0.291990
Yes	count	93.000000	93.000000	93.000000	93.000000
	mean	20.756344	3.008710	2.408602	0.163196
	std	9.832154	1.401468	0.810751	0.085119
	min	3.070000	1.000000	1.000000	0.035638
	25%	13.420000	2.000000	2.000000	0.106771
	50%	17.920000	3.000000	2.000000	0.153846
	75%	26.860000	3.680000	3.000000	0.195059
	max	50.810000	10.000000	5.000000	0.710345

		tip_pct		size
	smoker	No	Yes	No	Yes
sex	day
Female	Fri	0.165296	0.209129	2.500000	2.000000
	Sat	0.147993	0.163817	2.307692	2.200000
	Sun	0.165710	0.237075	3.071429	2.500000
	Thur	0.155971	0.163073	2.480000	2.428571
Male	Fri	0.138005	0.144730	2.000000	2.125000
	Sat	0.162132	0.139067	2.656250	2.629630
	Sun	0.158291	0.173964	2.883721	2.600000
	Thur	0.165706	0.164417	2.500000	2.300000

		tip_pct			size
	smoker	No	Yes	All	No	Yes	All
sex	day
Female	Fri	0.165296	0.209129	0.199388	2.500000	2.000000	2.111111
	Sat	0.147993	0.163817	0.156470	2.307692	2.200000	2.250000
	Sun	0.165710	0.237075	0.181569	3.071429	2.500000	2.944444
	Thur	0.155971	0.163073	0.157525	2.480000	2.428571	2.468750
Male	Fri	0.138005	0.144730	0.143385	2.000000	2.125000	2.100000
	Sat	0.162132	0.139067	0.151577	2.656250	2.629630	2.644068
	Sun	0.158291	0.173964	0.162344	2.883721	2.600000	2.810345
	Thur	0.165706	0.164417	0.165276	2.500000	2.300000	2.433333
All		0.159328	0.163196	0.160803	2.668874	2.408602	2.569672

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	2.5
2	Ohio	2002	3.0
3	Nevada	2001	2.5
4	Nevada	2002	3.5

		pop
state	year
Ohio	2000	1.5
	2001	2.5
	2002	3.0
Nevada	2001	2.5
Nevada	2002	3.5

	data1	data2	key1	key2
0	1.764052	-0.977278	a	one
1	0.400157	0.950088	a	two
2	0.978738	-0.151357	b	one
3	2.240893	-0.103219	b	two
4	1.867558	0.410599	a	one

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	total_bill	tip	sex	smoker	day	time	size	tip_pct
239	29.03	5.92	Male	No	Sat	Dinner	3	0.203927
240	27.18	2.00	Female	Yes	Sat	Dinner	2	0.073584
241	22.67	2.00	Male	Yes	Sat	Dinner	2	0.088222
242	17.82	1.75	Male	No	Sat	Dinner	2	0.098204
243	18.78	3.00	Female	No	Thur	Dinner	2	0.159744

		total_bill	tip_pct
sex	smoker
Female	No	28.58	0.156921
Female	Yes	41.23	0.182150
Male	No	40.82	0.160669
Male	Yes	43.56	0.152771

		total_bill	tip	size	tip_pct
sex	smoker
Female	No	18.105185	2.773519	2.592593	0.156921
Female	Yes	17.977879	2.931515	2.242424	0.182150
Male	No	19.791237	3.113402	2.711340	0.160669
Male	Yes	22.284500	3.051167	2.500000	0.152771

	total_bill	tip	sex	smoker	day	time	size	tip_pct	tips
229	22.12	2.88	Female	Yes	Sat	Dinner	2	0.130199	0.182150
230	24.01	2.00	Male	Yes	Sat	Dinner	4	0.083299	0.152771
231	15.69	3.00	Male	Yes	Sat	Dinner	3	0.191205	0.152771
232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990	0.160669
233	10.77	1.47	Male	No	Sat	Dinner	2	0.136490	0.160669
234	15.53	3.00	Male	Yes	Sat	Dinner	2	0.193175	0.152771
235	10.07	1.25	Male	No	Sat	Dinner	2	0.124131	0.160669
236	12.60	1.00	Male	Yes	Sat	Dinner	2	0.079365	0.152771
237	32.83	1.17	Male	Yes	Sat	Dinner	2	0.035638	0.152771
238	35.83	4.67	Female	No	Sat	Dinner	3	0.130338	0.156921
239	29.03	5.92	Male	No	Sat	Dinner	3	0.203927	0.160669
240	27.18	2.00	Female	Yes	Sat	Dinner	2	0.073584	0.182150
241	22.67	2.00	Male	Yes	Sat	Dinner	2	0.088222	0.152771
242	17.82	1.75	Male	No	Sat	Dinner	2	0.098204	0.160669
243	18.78	3.00	Female	No	Thur	Dinner	2	0.159744	0.156921

	total_bill	tip	sex	smoker	day	time	size	tip_pct
109	14.31	4.00	Female	Yes	Sat	Dinner	2	0.279525
183	23.17	6.50	Male	Yes	Sun	Dinner	4	0.280535
232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990
67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733
178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667
172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345

		total_bill	tip	sex	smoker	day	time	size	tip_pct
smoker
No	88	24.71	5.85	Male	No	Thur	Lunch	2	0.236746
	185	20.69	5.00	Male	No	Sun	Dinner	5	0.241663
	51	10.29	2.60	Female	No	Sun	Dinner	2	0.252672
	149	7.51	2.00	Male	No	Thur	Lunch	2	0.266312
	232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990
Yes	109	14.31	4.00	Female	Yes	Sat	Dinner	2	0.279525
	183	23.17	6.50	Male	Yes	Sun	Dinner	4	0.280535
	67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733
	178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667
	172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345

			total_bill	tip	sex	smoker	day	time	size	tip_pct
smoker	day
No	Fri	94	22.75	3.25	Female	No	Fri	Dinner	2	0.142857
	Sat	212	48.33	9.00	Male	No	Sat	Dinner	4	0.186220
	Sun	156	48.17	5.00	Male	No	Sun	Dinner	6	0.103799
	Thur	142	41.19	5.00	Male	No	Thur	Lunch	5	0.121389
Yes	Fri	95	40.17	4.73	Male	Yes	Fri	Dinner	4	0.117750
	Sat	170	50.81	10.00	Male	Yes	Sat	Dinner	3	0.196812
	Sun	182	45.35	3.50	Male	Yes	Sun	Dinner	3	0.077178
	Thur	197	43.11	5.00	Female	Yes	Thur	Lunch	4	0.115982

	day	Fri	Sat	Sun	Thur	All
sex	smoker
Female	No	2.0	13.0	14.0	25.0	54.0
Female	Yes	7.0	15.0	4.0	7.0	33.0
Male	No	2.0	32.0	43.0	20.0	97.0
Male	Yes	8.0	27.0	15.0	10.0	60.0
All		19.0	87.0	76.0	62.0	244.0