So far, we have fed our code tiny snippets of data, whether in a list, or a dictionary, or some simple one line blocks of data.
As a data scientist, you will however be reading data from files on your server or on your computer. You will then process the data, run your analyses on it, and save the corresponding output to another file.
In this section, we look at all the techniques needed for it.
There are three modes in while files are accessed:
+r: read-only mode.
+w: write to a file. If you are not careful, this will overwrite an existing content.
+a: append to a file. In simpler words, add to the end of the file.
Commonly Used Commands
file.read(): read contents of a file in its entirity, as one large string. Again, careful with this. Don't read a 30gb file when all of you have is a piddly 4gb machine.file.write(a_string): writes to the file. Writes very often get buffered, and your current file may be a few write commands behind. Think of it as a bus not starting for its destination till a minimum number of passengers aren't on board. Python tries to be efficient in combining multiple writes.file.flush(): write out any buffered writes. Remember, you need to write before you flush.file.close(): close the open file. Always remember to close a file, else it stays in memory and will slow down the system eventually.
In [1]:
f = open("example.txt", "w")
f.write("I refuse to start with a Hello World!\nThis is mutiny!\n")
f.write("Ok, I withdraw my previous statements.")
f.flush()
f.close()
In [2]:
# Remember your Unix lessons?
!cat example.txt
In [3]:
f2 = open("example.txt", "r")
f2_contents = f2.read()
f2.close()
In [5]:
f2_contents
Out[5]:
As you can see, the contents of the file are now assigned to a variable, f2_contents. f2_contents is a string.
In [6]:
type(f2_contents)
Out[6]:
Now that we have read from the file, we need to analyse it. Let's begin!
In [7]:
lines = f2_contents.split("\n")
lines
Out[7]:
In [8]:
len(lines)
Out[8]:
In [9]:
for line in lines:
#print("Length of line '{}' is {}".format(line, len(line)))
print (len(line))
And of course, all this applies to numbers too.
In [10]:
f = open("numbers.txt", "w")
for num in range(100):
f.write(str(num)+ '\n')
f.close()
In [11]:
!cat numbers.txt
In [12]:
f = open('numbers.txt', 'r')
f_content = f.read()
f.close()
f_content
Out[12]:
In [13]:
lines = f_content.split("\n")
print(lines)
In [14]:
type(lines)
Out[14]:
In [15]:
lines[0]
Out[15]:
In [16]:
type(lines[0])
Out[16]:
In [17]:
# Let's convert these into integers
integer_list = [int(num) for num in lines] # Oh no!
In [18]:
integer_list = [int(line) for line in lines if len(line)!= 0]
In [19]:
integer_list
Out[19]:
In [20]:
integer_list[0]
Out[20]:
In [21]:
type(integer_list)
Out[21]:
In [22]:
type(integer_list[0])
Out[22]:
And finally - time to be a....
In [23]:
!rm example.txt
!rm numbers.txt
All our theory has to lead to something. And here we have a hands on challenge. And just so you're not completely stumped by it, this is a problem you have partially worked on before.
As a data scientist, about 70 to 80% of your job will actually be exploring data sets, cleaning it, and running exploratory tests again on cleaned data sets. Modeling and predictions are trivial once the heavy lifting has been done, and I'm not exagerrating. Find any data scientist, and ask him or her what the most time consuming part of their job is.
Let's go to https://fakenumber.org/ and generate a few fake phone numbers. Then let's change it to look like a real world txt file with copied texts from multiple sources, words, names, extension numbers etc. Make it as difficult as possible for you. You have to keep challenging yourself to get to a higher and higher level. Think of it as playing COD/Super Mario/Contra/your favourite game. I have posted the solution a few boxes below, but this is for you to challenge yourself. No one's looking, and you can't cheat yourself!
In [ ]:
# This will create the phonenumbers.txt file. Unix FTW again!
%%file Data/phonenumbers.txt
202-555-0116
202-555-0181 Jill
202-555-0142
202-(555)-0173 Bryan
+1-202-555-0116
+1-202-555-0181
+1-202-555-0142 Raj
+1-202-555-0173
+1-(202)-555-0137 Jonah Lomu
Not really free 800-555-1231x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
ls
In [ ]:
def clean_up(phone_num):
number = ""
digits = {"0","1","2","3","4","5","6","7","8","9"}
for char in phone_num:
if char in digits:
number = result + c
return number
In [ ]:
f = open("Data/phonenumbers.txt", "r")
content = f.read()
f.close()
In [ ]:
# Let's have a peek at the file
content
In [ ]:
# Time to split the lines
lines = content.split("\n")
In [ ]:
"""
I am only doing this since we know there are a couple of lines
Else, please don't print the whole file. There are other techniques
to look at snippets of data from within a file, and we will go over
them in later lessons.
"""
lines
In [ ]:
for line in lines:
print(clean(line))