Title: Delete Duplicates In Pandas
Slug: pandas_delete_duplicates
Summary: Delete Duplicates In Pandas
Date: 2016-05-01 12:00
Category: Python
Tags: Data Wrangling
Authors: Chris Albon

import modules



In [1]:

    
import pandas as pd

Create dataframe with duplicates



In [2]:

    
raw_data = {'first_name': ['Jason', 'Jason', 'Jason','Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Miller', 'Miller','Ali', 'Milner', 'Cooze'], 
        'age': [42, 42, 1111111, 36, 24, 73], 
        'preTestScore': [4, 4, 4, 31, 2, 3],
        'postTestScore': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df









    Out[2]:







  
    
      
      first_name
      last_name
      age
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42
      4
      25
    
    
      1
      Jason
      Miller
      42
      4
      25
    
    
      2
      Jason
      Miller
      1111111
      4
      25
    
    
      3
      Tina
      Ali
      36
      31
      57
    
    
      4
      Jake
      Milner
      24
      2
      62
    
    
      5
      Amy
      Cooze
      73
      3
      70

Identify which observations are duplicates



In [3]:

    
df.duplicated()









    Out[3]:





0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

Drop duplicates



In [4]:

    
df.drop_duplicates()









    Out[4]:







  
    
      
      first_name
      last_name
      age
      preTestScore
      postTestScore
    
  
  
    
      0
      Jason
      Miller
      42
      4
      25
    
    
      2
      Jason
      Miller
      1111111
      4
      25
    
    
      3
      Tina
      Ali
      36
      31
      57
    
    
      4
      Jake
      Milner
      24
      2
      62
    
    
      5
      Amy
      Cooze
      73
      3
      70

Drop duplicates in the first name column, but take the last obs in the duplicated set



In [5]:

    
df.drop_duplicates(['first_name'], keep='last')









    Out[5]:







  
    
      
      first_name
      last_name
      age
      preTestScore
      postTestScore
    
  
  
    
      2
      Jason
      Miller
      1111111
      4
      25
    
    
      3
      Tina
      Ali
      36
      31
      57
    
    
      4
      Jake
      Milner
      24
      2
      62
    
    
      5
      Amy
      Cooze
      73
      3
      70

	first_name	last_name	age	preTestScore	postTestScore
0	Jason	Miller	42	4	25
1	Jason	Miller	42	4	25
2	Jason	Miller	1111111	4	25
3	Tina	Ali	36	31	57
4	Jake	Milner	24	2	62
5	Amy	Cooze	73	3	70