©2019 Raazesh Sainudiin. Attribution 4.0 International (CC BY 4.0)
In [1]:
def showURL(url, ht=500):
"""Return an IFrame of the url to show in notebook with height ht"""
from IPython.display import IFrame
return IFrame(url, width='95%', height=ht)
showURL('https://en.wikipedia.org/wiki/Bash_(Unix_shell)',400)
Out[1]:
In [2]:
%%sh
pwd
In [8]:
%%sh
# this is a comment in BASH shell as it is preceeded by '#'
ls # list the contents of this working directory
In [7]:
%%sh
mkdir mydir
In [11]:
%%sh
cd mydir
pwd
ls -al
In [12]:
%%sh
pwd
man
-ning the unknown command
By evaluating the next cell, you are using thr man
ual pages to find more about the command ls
. You can learn more about any command called command
by typing man command
in the BASH shell.
The output of the next cell with command man ls
will look something like the following:
LS(1) User Commands LS(1)
NAME
ls - list directory contents
SYNOPSIS
ls [OPTION]... [FILE]...
DESCRIPTION
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is speci‐
fied.
Mandatory arguments to long options are mandatory for short options
too.
-a, --all
do not ignore entries starting with .
-A, --almost-all
do not list implied . and ..
...
...
...
Exit status:
0 if OK,
1 if minor problems (e.g., cannot access subdirectory),
2 if serious trouble (e.g., cannot access command-line argument).
AUTHOR
Written by Richard M. Stallman and David MacKenzie.
REPORTING BUGS
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report ls translation bugs to <http://translationproject.org/team/>
COPYRIGHT
Copyright © 2017 Free Software Foundation, Inc. License GPLv3+: GNU
GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
SEE ALSO
Full documentation at: <http://www.gnu.org/software/coreutils/ls>
or available locally via: info '(coreutils) ls invocation'
GNU coreutils 8.28 January 2018 LS(1)
In [ ]:
%%sh
man ls
In [6]:
%%sh
cd mydir
curl -O http://lamastex.org/datasets/public/SOU/sou/20170228.txt
In [14]:
%%sh
ls mydir/
In [8]:
%%sh
cd mydir/
head 20170228.txt
In [1]:
%%sh
mkdir -p mydir # first create a directory called 'mydir'
cd mydir # change into this mydir directory
rm -f sou.tar.gz # remove any file in mydir called sou.tar.gz
curl -O http://lamastex.org/datasets/public/SOU/sou.tar.gz
In [2]:
%%sh
pwd
ls -lh mydir
In [ ]:
%%sh
cd mydir
tar zxvf sou.tar.gz
After running the above two cells, you should have all the SOU (State of Union) addresses. By evaluating the next cell's ls ...
command you should see the SOU files like the following:
total 11M
-rw------- 1 raazesh raazesh 6.6K Feb 18 2016 17900108.txt
-rw------- 1 raazesh raazesh 8.3K Feb 18 2016 17901208.txt
-rw------- 1 raazesh raazesh 14K Feb 18 2016 17911025.txt
...
...
...
-rw------- 1 raazesh raazesh 39K Feb 18 2016 20140128.txt
-rw------- 1 raazesh raazesh 38K Feb 18 2016 20150120.txt
-rw------- 1 raazesh raazesh 31K Feb 18 2016 20160112.txt
In [ ]:
%%sh
ls -lh mydir/sou
In [5]:
%%sh
head mydir/sou/17900108.txt
In [1]:
%%sh
head mydir/sou/20160112.txt
An interesting analysis of the textual content of the State of the Union (SoU) addresses by all US presidents was done in:
Fig. 5. A river network captures the flow across history of US political discourse, as perceived by contemporaries. Time moves along the x axis. Clusters on semantic networks of 300 most frequent terms for each of 10 historical periods are displayed as vertical bars. Relations between clusters of adjacent periods are indexed by gray flows, whose density reflects their degree of connection. Streams that connect at any point in history may be considered to be part of the same system, indicated with a single color.
You will be able to carry out such analyses and/or critically reflect on the mathematical statistical assumptions made in such analyses, as you learn more during your programme of study after successfully completing this course.
sou.tgz
file was created?If you are curious, read: http://lamastex.org/datasets/public/SOU/README.md.
Briefly, this is how a website with SOU was scraped by Paul Brouwers and adapted by Raazesh Sainudiin. A data scientist, and more generally a researcher interested in making statistical inference from data that is readily available online in a particular format, is expected to be comfortable with such web-scraping tasks (which can be done in more gracious and robust ways using specialised Python libraries). Such tasks also known as Extract-Load-Transform (ELT) operations are often time-consuming, expensive andnthe necessary first step towards statistical inference.
The code below is mainly there to show how the text content of each state of the union address was scraped from the following URL:
Such data acquisition tasks is usually the first and cucial step in a data scientist's workflow.
We have done this and put the data in the distributed file system for easy loading into our notebooks for further analysis. This keeps us from having to install unix programs like lynx
, sed
, etc. that are needed in the shell script below.
for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html | grep texts | grep -v index | sed 's/.*http/http/') ; do lynx --dump $i | tail -n+13 | head -n-14 | sed 's/^\s\+//' | sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' > $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt ; done
Or in a more atomic form:
for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html \
| grep texts \
| grep -v index \
| sed 's/.*http/http/')
do
lynx --dump $i \
| tail -n+13 \
| head -n-14 \
| sed 's/^\s\+//' \
| sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' \
> $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt
done
Evaluate the following two cells by replacing X
with the right command-line option to wc
command in order to find:
data/earthquakes_small.csv
and data/earthquakes_small.csv
Finally, update the following cell by replacing XXX
with the right integer answers, respectively, for:
NumberOfLinesIn_earthquakes_small_csv_file
and NumberOfCharactersIn_earthquakes_small_csv_file
Here is a brief synopsis of wc
that you would get from running man wc
as follows:
%%sh
man wc
WC(1) BSD General Commands Manual WC(1)
NAME
wc -- word, line, character, and byte count
SYNOPSIS
wc [-clmw] [file ...]
DESCRIPTION
The wc utility displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output. A line is defined as a string of characters delimited by a <newline> character. Characters beyond the final <newline> character will not be included in the line count.
A word is defined as a string of characters delimited by white space characters. White space characters are the set of characters for which the iswspace(3) function returns true. If more than one input file is specified, a line of cumulative counts for all the files is displayed on a separate line after the output for the last file.
The following options are available:
-c The number of bytes in each input file is written to the standard output. This will cancel out any prior usage of the -m option.
-l The number of lines in each input file is written to the standard output.
-m The number of characters in each input file is written to the standard output. If the current locale does not support multibyte
characters, this is equivalent to the -c option. This will cancel out any prior usage of the -c option.
-w The number of words in each input file is written to the standard output.
When an option is specified, wc only reports the information requested by that option. The order of output always takes the form of line, word, byte, and file name. The default action is equivalent to specifying the -c, -l and -w options.
In [ ]:
%%sh
# replace X in the next line with the right option to find the number of lines
wc -X data/earthquakes_small.csv
In [ ]:
%%sh
# replace X in the next line with the right option to find the number of characters
wc -X data/earthquakes_small.csv
In [11]:
# write your answer below by replacing XXX don't modify anything else!
NumberOfLinesIn_earthquakes_small_csv_file = XXX
NumberOfCharactersIn_earthquakes_small_csv_file = XXX
In [12]:
# Evaluate this cell locally to make sure you have the answer as a non-negative integer
try:
assert(NumberOfLinesIn_earthquakes_small_csv_file > -1)
print("Good! You have 0 or more lines as your answer. Hopefully it is the correct!")
except AssertionError:
print("Try Again. You seem to not have a valid number of lines as your answer.")
try:
assert(NumberOfCharactersIn_earthquakes_small_csv_file > -1)
print("Good! You have 0 or more characters as your answer. Hopefully it is the correct!")
except AssertionError:
print("Try Again. You seem to not have a valid number of characters as your answer.")