Please rename this file before editing!

Introduction

The objective of this session is to introduce some basic steps on how to work in our computing environment using the UNIX command line and the Jupyter notebook. In the second part we will use a couple of UNIX commands to work on our first data analysis.

UNIX and UNIX-like systems

Year Milestone
1969 Ken Thompson, Dennis Ritchie and others started working on the "little-used PDP-7 in a corner" at Bell Labs and
1971 UNIX 1st edition
1989 NeXT Computer was launched by Steve Jobs with the NeXTSTEP operating system
1991 Initial release of Linux a Unix-like computer operating system assembled under the model of free and open-source software development by Linux Torvalds
2001 Apple launches Mac OS X, it's first UNIX-based operating system

Originally, computer manufacturers would ship their proprietary operating systems. Many companies created their own flavor or UNIX, licensing parts of the software from other vendors, and adding their own pieces. This resulted in a huge number of dialects.

The game changed with LINUX, a UNIX-like operating system that was written from scratch. I.e. the implementation mimics the functionality of other UNIX versions but does not use any source code from those flavors.

Source: https://en.wikipedia.org/wiki/History_of_Unix

Essential UNIX Commands for File Management

Source: http://homepages.uc.edu/~thomam/Intro_Unix_Text

The following commands will be helpful in the Unix environment. Most commands have flags that can be used to modify their function. You can get detailed documentation by typing: $ man <command name> at the command prompt (note: ``$'' is used to denote the prompt, do not type it).

  • cd: change directory.

    • example: $ cd ~/homework_files/ will change the working directory to ~/homework_files/. Note: "~" is a shorthand for your home directory.
  • ls: list items in directory.

    • $ ls -1 will list the files in single-column format
    • $ ls -lh will list the files in long format, with file sizes in human-readable format
    • $ ls -a will list all files, including hidden files (those starting with ``.'')
  • rm: remove file(s), e.g.

    • $ rm file.txt removes file.txt;
    • $ rm file*.txt (with wildcard) removes all files that begin with ``file'' and end with ".txt". Note: rmdir can be used to remove (empty) directories.
  • mv: move or rename file.

    • example: $ mv file.txt newname.txt will rename file.txt to newname.txt.
    • example: $ mv file.txt ~/newdir/ will move file.txt to directory newdir.
  • cp: copy file.

    • example: $ cp file.txt copy.txt will copy file.txt to copy.txt.
  • mkdir: create a new directory.

    • example: $ mkdir newdir will create a new directory called newdir.
  • more and less: display contents of a (plain text) file.

    • example: $ more file.txt will print the contents of file.txt to the standard output.
  • head: show the first few lines of a file (you can change the number of lines using the -n flag).

  • tail: show the last few lines of a file (you can change the number of lines using the -n flag).

  • wc: count words in a document. Without any additional commands, gives line, word, and byte counts for a file.

    • $ wc -l file.txt will count the lines in document file.txt.
  • grep: find specific word in a document.

    • $ grep -n 'list' file.txt will display lines with the word 'list' in them (default behavior).
    • $ grep -l 'list' . will return the name of the files in the current directory ('.') that match the pattern 'list'.
    • $ grep -r 'list' . will search the directory '.' recursively (i.e., including subdirectories).
    • $ grep -o 'list' . will return only the matching part of the line.
    • $ grep -c 'list' . will count the number of matching lines.
    • You can use extended regular expressions with the '-E' flag.
  • sort: return a sorted lines of text.

    • $ sort -u file.txt returns only unique values in the sort (no repeats)
    • $ sort -f ignores case (ie, converts everything to lowercase)
  • tr: replace (or delete) characters from standard input, write to standard output.

    • $ tr -d 'a' will delete all occurences of 'a' in the input, instead of replacing them.
    • $ tr -s 'a' 'A' will 'squeeze' any adjacent 'a's into a single occurence, and replace it with 'A', so for example 'aaaarrrgh' will be replaced by 'Arrrgh'.
    • You can do more than one replacement at a time, for example $ tr -s 'ar' 'AR' will convert 'aaaarrrgh' to 'ARgh'.
  • wget and curl: non-interactive downloading of files from the Web. You may use it later in the semester to download datasets.

  • (un)zip, g(un)zip, zcat, and tar: can be used to compress files or to uncompress archives.

  • exit: end current shell session and close.

Let's try it out: cloning the class repository

First we're going to setup the directory structure for this course and clone the GitHub repository.

  1. Login in via SSH into the server arc.insight.gsu.edu
  2. Check that you are in your home directory, and see what files already exist
  3. Create a new directory named MSA8010F17 (this is case sensitive!)
  4. Navigate into this new directory, i.e. changing it to your current working directory
  5. Use your web-browser to find the class repository on GitHub and follow the instructions on how to clone it (via HTTP). Copy the corresponding URL into your clipboard.
  6. On the server run the command
    $ git clone URL
    Where URL is the one from the GitHub website.

In [ ]:
%%sh
cd 
ls -la

In [ ]:

Keeping the respository updated

  • Once you cloned the repository you can download any updates with a single command git pull
  • Note: your working directory has to be the repository or one of its subdirectories. Navigate into the right directory before executing the command
  • Warning: Make sure to rename a file before editing. If you edit a file that exists in the repository your pull request will fail. If that happens rename or move the offending file and pull again.

In [ ]:

The Concept of Pipes and Redirections

Source: https://en.wikipedia.org/wiki/Pipeline_(Unix)

Examples

  • ls -l | more
  • ls -R > all_my_files.txt
  • cat

  • The standard input STDIN can also be connect to a file using the < symbol

  • The standard output STDOUT can be redirected to a file using the > or >> symbols. Hereby >> appends the new content to an existing file.
  • The standard error STDERR can be redirected to file using 2> and 2>>. The symbol 2>&1 merges STDOUT and STDERR into one stream.

Examples

  • ls -l | more
  • ls -R > all_my_files.txt
  • cat all_my_files.txt | grep foo > data.dat
  • sort < data.dat
  • echo -n "The number of README files in this directory is: "; find . -name "README*" | wc -l
  • echo -n "There are `find . -name "README*" | wc -l` README files in this directory tree."

In [ ]:


In [ ]:
%%sh
echo -n "There are `find .. -name "README*" | wc -l` README files in the parent directory tree."

UNIX Commands/Tools for Data Manipulation and Analysis

  • grep
  • sort
  • uniq
  • tr
  • wc
  • cut

Advanced

  • awk
  • sed

The Works by Shakespeare

Let's see what we can do with Shakespeare's collected body of work.

We can download everything from http://www.gutenberg.org/ebooks/100 in plain text format


In [ ]:
%%sh
mkdir -p data
cd data
wget http://www.gutenberg.org/cache/epub/100/pg100.txt 2> /dev/null
cd ..
ls -l data

Let's look at the file in the terminal using the commands more or less

There is a lot of "junk":

  • There is a lot of other text like legal notices included.
  • Special character and even upper and lower case words will affect the analysis.

Tricks

  • Use tr -s ' ' to squash duplicate characters (banks in this case)
  • Filter entire repeating passages by
    1. copy those lines into a file
    2. run grep -v -f data/legalnotice.txt < data/pg100.txt
  • Filter empty/blank lines with grep -v -e '^[[:space:]]*$' this those are the little things you find on StackOverflow https://stackoverflow.com/questions/3432555/remove-blank-lines-with-grep ... usually after some digging.

Some questions?

  • How often do the terms "love", "hate", "murder", "faith" appear in the text?
  • What are the most frequent words

In [ ]:
%%sh
cat data/pg100.txt | grep hate | head -20
echo
echo -n "The answer is: "
cat data/pg100.txt | grep hate | wc -l

In [ ]:
%%sh
cat data/pg100.txt | tr ' ' '\n' | grep hate | sort | uniq -c | sort -rn | head -20
echo
echo -n "The answer is: "
cat data/pg100.txt | tr ' ' '\n' | grep hate | wc -l

Not bad. But this could be improved...

What are the most frequent words?


In [ ]:
%%sh
grep -v -f data/legalnotice.txt data/pg100.txt \
| tr -d '.,:?"' \
| tr 'A-Z' 'a-z' \
| tr ' ' '\n' \
| tr -s '\n' \
| grep -v -e '^[[:space:]]*$' \
| sort \
| uniq -c \
| sort -rn \
| head -30

News Groups Dataset

http://qwone.com/~jason/20Newsgroups/

he 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

DO NOT DOWNLOAD THESE FILES ONTO THE CLUSTER!

Instead, use the shared data directory /home/data/20_newsgroup/


In [ ]:
%%sh
DATADIR=/home/data/20_newsgroup/
ls $DATADIR | while read TOPIC; do
echo -n "Topic $TOPIC: number of documents "
ls $DATADIR/$TOPIC | wc -l
done | cat -b

Some questions

  • What are the most frequent words in each topic?
  • Is there a certain set of words that is unique to a particular topic?
  • Can we score documents based on how often those topic specific words appear?
  • Is it possible to determine the topic of an unknown document by this score?

Scripts with a Hash-Bang

Even in the few examples above we have used the same sequence of commands over and over again. We should create new commands (or scripts) for those steps.

How to create a script:

  1. Create a text file with the name of your new "command". We often add something like ".sh" or ".py" to indicate which language the script is written in. E.g. wordfrequency.sh
  2. The very first line of the text file must indicate the interpreter that is going to execute the program. In our case #!/bin/bash
  3. The executable permissions need to be set so that we can use the new script just like any other command. This is done with chmod a+x wordfrequency.sh

Our script would look something like this:

#!/bin/bash
tr -d '.,:?"' \
| tr 'A-Z' 'a-z' \
| tr ' ' '\n' \
| grep -v -e '^[[:space:]]*$' \
| sort \
| uniq -c \
| sort -rn