Change the filename here

The csv file without extension. Its new name will be [filename]-corrected.csv


In [1]:
import re
filename = 'tabula-Actelion_transparency-report-2015'
file = open(filename+'.csv', 'r')
content = file.readlines()
content[:10]


Out[1]:
['"Alexandre Dumus\n',
 'c",LAUSANNE,Switzerland,"Avenue Pierre-De\n',
 'cker 4","",N/A,N/A,560.00,486.00,"","","","1,046.00"\n',
 '"Ana-Luisa Stefans\n',
 'ki",BERN,Switzerland,"Freiburgstrasse 1\n',
 '8","",N/A,N/A,560.00,333.00,"","","",893.00\n',
 'Andrea Azzola,LUGANO,Switzerland,Via Tesserete 46,"",N/A,N/A,280.00,174.50,"","","",454.50\n',
 'Angela Oxenius,ZÜRICH,Switzerland,"Steinwiesstrasse \n',
 '75","",N/A,N/A,206.00,550.33,"","","",756.33\n',
 '"Anna Cavigelli-Bru\n']

Regexp to match the two last digits of each line

Change it if your doc has another format


In [18]:
twodigits = re.compile('\.?\d{2}"?$')
#re.match(twodigits, '14"')
re.match(twodigits, '.14')


Out[18]:
<_sre.SRE_Match object; span=(0, 3), match='.14'>

Function to join the broken lines


In [21]:
c = 0
def join_broken_lines(line_list):
    global c
    full_lines = []
    full_line = ''
    for line in line_list:
        full_line += line[:-1]
        if len(line) > 1:
            if line[-2] == '"' or re.match(twodigits, line[-4:-1]):
#                full_lines.append(str(c)+") " + full_line)
                full_lines.append(str(c)+") " + full_line)
                full_line = ''
                c += 1
    return full_lines

result = join_broken_lines(content)
print(*result[:5], sep="\n") # we print the 5 first lines


0) "Alexandre Dumusc",LAUSANNE,Switzerland,"Avenue Pierre-Decker 4","",N/A,N/A,560.00,486.00,"","","","1,046.00"
1) "Ana-Luisa Stefanski",BERN,Switzerland,"Freiburgstrasse 18","",N/A,N/A,560.00,333.00,"","","",893.00
2) Andrea Azzola,LUGANO,Switzerland,Via Tesserete 46,"",N/A,N/A,280.00,174.50,"","","",454.50
3) Angela Oxenius,ZÜRICH,Switzerland,"Steinwiesstrasse 75","",N/A,N/A,206.00,550.33,"","","",756.33
4) "Anna Cavigelli-Brunner",ZÜRICH,Switzerland,"Steinwiesstrasse 75","",N/A,N/A,206.00,550.33,"","","",756.33

Save to the new file


In [22]:
newcontent = str.join("\n", result)
fp = open(filename + '-corrected.csv', 'w')
fp.write(newcontent)
fp.close()

In [ ]: