The objective of this session is to introduce some basic steps on how to work in our computing environment using the UNIX command line and the Jupyter notebook. In the second part we will use a couple of UNIX commands to work on our first data analysis.
Year | Milestone |
---|---|
1969 | Ken Thompson, Dennis Ritchie and others started working on the "little-used PDP-7 in a corner" at Bell Labs and |
1971 | UNIX 1st edition |
1989 | NeXT Computer was launched by Steve Jobs with the NeXTSTEP operating system |
1991 | Initial release of Linux a Unix-like computer operating system assembled under the model of free and open-source software development by Linux Torvalds |
2001 | Apple launches Mac OS X, it's first UNIX-based operating system |
Originally, computer manufacturers would ship their proprietary operating systems. Many companies created their own flavor or UNIX, licensing parts of the software from other vendors, and adding their own pieces. This resulted in a huge number of dialects.
The game changed with LINUX, a UNIX-like operating system that was written from scratch. I.e. the implementation mimics the functionality of other UNIX versions but does not use any source code from those flavors.
The following commands will be helpful in the Unix environment. Most commands have flags that can be used to modify their function. You can get detailed documentation by typing: $ man <command name>
at the command prompt (note: ``$
'' is used to denote the prompt, do not type it).
cd: change directory.
$ cd ~/homework_files/
will change the working directory to ~/homework_files/. Note: "~" is a shorthand for your home directory.ls: list items in directory.
$ ls -1
will list the files in single-column format$ ls -lh
will list the files in long format, with file sizes in human-readable format$ ls -a
will list all files, including hidden files (those starting with ``.'')rm: remove file(s), e.g.
$ rm file.txt
removes file.txt;$ rm file*.txt
(with wildcard) removes all files that begin with ``file'' and end with ".txt". Note: rmdir can be used to remove (empty) directories.mv: move or rename file.
$ mv file.txt newname.txt
will rename file.txt to newname.txt.$ mv file.txt ~/newdir/
will move file.txt to directory newdir.cp: copy file.
$ cp file.txt copy.txt
will copy file.txt to copy.txt.mkdir: create a new directory.
$ mkdir newdir
will create a new directory called newdir.more and less: display contents of a (plain text) file.
$ more file.txt
will print the contents of file.txt to the standard output.head: show the first few lines of a file (you can change the number of lines using the -n flag).
tail: show the last few lines of a file (you can change the number of lines using the -n flag).
wc: count words in a document. Without any additional commands, gives line, word, and byte counts for a file.
$ wc -l file.txt
will count the lines in document file.txt.grep: find specific word in a document.
$ grep -n 'list' file.txt
will display lines with the word 'list' in them (default behavior).$ grep -l 'list' .
will return the name of the files in the current directory ('.') that match the pattern 'list'.$ grep -r 'list' .
will search the directory '.' recursively (i.e., including subdirectories).$ grep -o 'list' .
will return only the matching part of the line.$ grep -c 'list' .
will count the number of matching lines.'-E'
flag.sort: return a sorted lines of text.
$ sort -u file.txt
returns only unique values in the sort (no repeats)$ sort -f
ignores case (ie, converts everything to lowercase)tr: replace (or delete) characters from standard input, write to standard output.
$ tr -d 'a'
will delete all occurences of 'a' in the input, instead of replacing them.$ tr -s 'a' 'A'
will 'squeeze' any adjacent 'a's into a single occurence, and replace it with 'A', so for example 'aaaarrrgh' will be replaced by 'Arrrgh'.$ tr -s 'ar' 'AR'
will convert 'aaaarrrgh' to 'ARgh'.wget and curl: non-interactive downloading of files from the Web. You may use it later in the semester to download datasets.
(un)zip, g(un)zip, zcat, and tar: can be used to compress files or to uncompress archives.
exit: end current shell session and close.
First we're going to setup the directory structure for this course and clone the GitHub repository.
arc.insight.gsu.edu
MSA8010F17
(this is case sensitive!)$ git clone URLWhere
URL
is the one from the GitHub website.
In [ ]:
%%sh
cd
ls -la
In [ ]:
git pull
In [ ]:
Examples
ls -l | more
ls -R > all_my_files.txt
cat
The standard input STDIN can also be connect to a file using the <
symbol
>
or >>
symbols. Hereby >>
appends the new content to an existing file.2>
and 2>>
. The symbol 2>&1
merges STDOUT and STDERR into one stream.ls -l | more
ls -R > all_my_files.txt
cat all_my_files.txt | grep foo > data.dat
sort < data.dat
echo -n "The number of README files in this directory is: "; find . -name "README*" | wc -l
echo -n "There are `find . -name "README*" | wc -l` README files in this directory tree."
In [ ]:
In [ ]:
%%sh
echo -n "There are `find .. -name "README*" | wc -l` README files in the parent directory tree."
We can download everything from http://www.gutenberg.org/ebooks/100 in plain text format
In [ ]:
%%sh
mkdir -p data
cd data
wget http://www.gutenberg.org/cache/epub/100/pg100.txt 2> /dev/null
cd ..
ls -l data
Let's look at the file in the terminal using the commands more
or less
There is a lot of "junk":
tr -s ' '
to squash duplicate characters (banks in this case)grep -v -f data/legalnotice.txt < data/pg100.txt
grep -v -e '^[[:space:]]*$'
this those are the little things you find on StackOverflow https://stackoverflow.com/questions/3432555/remove-blank-lines-with-grep ... usually after some digging.
In [ ]:
%%sh
cat data/pg100.txt | grep hate | head -20
echo
echo -n "The answer is: "
cat data/pg100.txt | grep hate | wc -l
In [ ]:
%%sh
cat data/pg100.txt | tr ' ' '\n' | grep hate | sort | uniq -c | sort -rn | head -20
echo
echo -n "The answer is: "
cat data/pg100.txt | tr ' ' '\n' | grep hate | wc -l
Not bad. But this could be improved...
What are the most frequent words?
In [ ]:
%%sh
grep -v -f data/legalnotice.txt data/pg100.txt \
| tr -d '.,:?"' \
| tr 'A-Z' 'a-z' \
| tr ' ' '\n' \
| tr -s '\n' \
| grep -v -e '^[[:space:]]*$' \
| sort \
| uniq -c \
| sort -rn \
| head -30
http://qwone.com/~jason/20Newsgroups/
he 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
DO NOT DOWNLOAD THESE FILES ONTO THE CLUSTER!
Instead, use the shared data directory/home/data/20_newsgroup/
In [ ]:
%%sh
DATADIR=/home/data/20_newsgroup/
ls $DATADIR | while read TOPIC; do
echo -n "Topic $TOPIC: number of documents "
ls $DATADIR/$TOPIC | wc -l
done | cat -b
Even in the few examples above we have used the same sequence of commands over and over again. We should create new commands (or scripts) for those steps.
How to create a script:
wordfrequency.sh
#!/bin/bash
chmod a+x wordfrequency.sh
Our script would look something like this:
#!/bin/bash
tr -d '.,:?"' \
| tr 'A-Z' 'a-z' \
| tr ' ' '\n' \
| grep -v -e '^[[:space:]]*$' \
| sort \
| uniq -c \
| sort -rn