This notebook introduces some of the tools we will use for data science: Pandas and Python. Python is a generic program language, designed in the early 1990s with scripting in mind, but with a view to ensuring that the scripts that were written could be easily converted for use in applications. It's a very popular language for web services, applications and, increasingly scientific computing.
Whilst it was not originally designed for scientific computing, the fact that it used an interactive shell, and had a straightforward syntax, made it an attractive choice to add scientific functionality. This functionality began to emerge from the early 2000s and now consists of a family of libraries, sometimes collectively known as scipy, including an array package, numpy. A particular advance was the IPython interactive shell which initially was a more powerfull environment for interoperating with the Python kernel but today has evolved into the IPython notebook, currently being renamed Jupyter: it is an interactive environment for scientifica computing.
The notebook was inspired by the facilities of Mathematica: Wolfram Research's mathematical language for symbolic computing. However, Mathematica never really caught on as a standard tool.
The IPython notebook was renamed as the Jupyter project to reflect its wide applicability: Jupyter stands for Julia, Python, and R. Each of which can be interacted with through the notebook.
In this session we will go through the basic functionality that the notebook gives us.
Your first question might be, why a notebook, why not an interactive shell like in MATLAB, R or Octave. In an interactive shell, we type commands and press Enter and the command is executed. These interfaces are derived from terminals which were originally a way of interoperating with main frame computer. The earliest terminals were more like typewriters, lacking a screen, but 'echoing' the commands that had been sent to the computer on a piece of paper. In my first graduate job, I worked for a company called Schlumberger, we terminals like this on a computer system called the CSU (it stood for Cyber Service Unit ... I think it was developed initially in the 1970s) which interfaced with DEC PDP J11 processors for performing data analysis on oil rigs. There was no printer, the output, known as a well log, was produced on a long role of film, about 30 centimeters wide, and typically several meters long (depending on the length of the oil well we were analyzing). perhaps many meters long which had to be developed inside the 'logging unit', a blue hut that was placed on the oil rig and contained the CSU. This was in 1994-1996 and I was one of the last Field Engineers to be trained on this data acquisition system. Jobs involved pulling different toolsets through the well to make measurements. The longest tool combination was known as a 'super-combo'. It consisted of a sonic tool, for formation sonic transit time, a density tool (with a gamma ray source), a porosity tool (with a neutron source) a formation conductivity tool (either an inductive tool or the 'laterolog', a formation pressure sensor and a caliper tool (for measuring the diameter of the hole). Each tool was contained in a, typically aluminium, 'sonde' with variations in design of the sonde that depended on the role: the inductive tool looked like it was made of a black fibreglass, the sonic transit time had a grill of machined wholes for emiting and receiving sonic ticks. Data was acquired to large tapes, of the type people typically associate with 1960s recording studios. These tapes occupied large drawers in a big bank of computing equipment (modules for communicating with the tools, processors, a pullout desk for the terminal) that covered the entire back wall of the CSU's blue cabin. All was connected together with the Unibus and the system had a very impressive start-up noise when the large switch in the top corner was pulled.
In [3]:
from IPython.display import Image
Image(url='http://2.bp.blogspot.com/-O6FC-DLQI5w/ThWTeDxv2-I/AAAAAAAAB2E/3C7XrbFekp8/s1600/Logging_unit.jpg')
Out[3]:
We used to run the toolset to the bottom of the well and pull it slowly to the surface (either via our cable, known as the 'wireline' or on the end of the drill pipe for wells where the incline was not steep enough for the wireline).
Data acquisition was known as logging. Logging commands (start logging, stop logging) were entered in the terminal. There was no editing of the command, if you got it wrong you had to 'escape' the entry and start again. For this reason commands were short, typically three letters and rather cryptic. It all looked very technical and impressive, and all trainees had to spend 3 months training on the system, including final tests that simulated a full job logging an artificial well drilled just off the M8 in Livingstone (look to the left as you drive from Edinburgh to Glasgow, the three fake rigs are probably still there).
When I went into the 'field', it turned out I was working in the UK, in the oil rigs drilled by BHP (then Hamilton) in Liverpool Bay. The UK is dominated by North Sea, which was at the time considered very technologically advanced. The CSU units in the UK had dispensed with the simple typewriter terminal for interace, and made use of an early Compaq 386 computer running the Microsoft version of Unix: Minix. This was a boon, as it offered several terminals, I think up to 9. And some processing could be done in the computer. Logs could be 'played back' in the 386 and viewed. Playing back used to mean literally that, interfacing the CSU to the tape and reading back off the tape, rather than via the 'live tools' where information was coming in through the modules. In live logging data acquisition was restricted by the speed with which the tool could be pulled along the formation (I think less than 30 cm per second). In playback it was restricted by the speed of the tape.
The Compaq Minix system (known in the company as the PCU) was more powerful than the entire rest of the CSU combined. It had larger memory and more processing, but it was mainly acting as a glorified terminal. It had no graphical user interface, but because commands were editable, and there was a command history, using it was far simpler and less time consuming that the old teletype unit.
Therein may lie the reason that the shell became popular, by memorizing a series of short commands, many functionalities could be called upon. The Unix shell, sh, gained in functionality for better memorisation of commands with evolved versions like bash and tcsh, also zsh.
Early scientific computing programming languages also made use of command shells to operate the language in an interactive mode. The model for languages like Java and C was a compilation model: code was written, compiled and then run. Interactive languages allow code to be written 'on the fly'. When interacting with the operating system, this makes a lot of sense. Scripting languages were then developed to allow a series of shell commands to be stored and rewritten at once. At the same time, similar funcitonality was required for scientific computing (languages like S and MATLAB). MATLAB was originally written as a simple interface to FORTRAN libraries that would help students learn linear algebra.
Working in a shell for scientific computing allows for far more rapid analysis than compilation orientated languages. By allow interaction with the data statistics can be computed or plots can be visualized allowing the next input command to be contingent on the previous. This allows a closer partnership between the computer and the human. The computer enhances the humans capabilities in areas where they are ordinarily weak. When I first started programming in 1983 on a BBC computer, the vision was that humans would use programming languages to interact with the computer. However, the nature of the interface was rather impersonal. It was a very simple shell that allowed programs to be typed into memory and run. Graphical user interfaces came with the promise of a more friendly interaction, but they don't necessarily allow the human and computer to work together in partnership. The expertise of the user is projected onto the menu design. It's not necessarily easy to find out how to automate repetetive actions. Even if the provision is there through scripting languages, it often necessitates switching interface and writing in a bespoke scripting language. I worked in an accountancy department in 1990 as a summer job, before heading to University, at Esso Engineering Europe in Leatherhead, where my Dad was one of the senior managers. Most of my time was spent writing Budget models in Lotus 1-2-3 (a predecessor of Excel) spreadsheets. Command of the spreadsheet was achieved through 'keystrokes' that accessed the menue system. For example, /re stood for slash (to access the menu); range; erase causing the spreadsheet to prompt you to highlight a 'range' of cells which were then erased. A macro to automatically apply the same command set was simply written /reA1:A6 which would remove the contents of the cells from A1 to A6 (other commands that stick in my mind are /ws for worksheet; save.
It was very easy to 'record' macros as the computer simply remembered the keystrokes you'd entered. I wrote a range of macros to automate my job. I'd often record the macro whilst I was working with the data, and then edit to simplify it. It was a more interactive approach to programming and data handling than could be achieved through BBC Basic. When the firm moved across to Excel the user interface was more intuitive: you could now simply select a range of cells and press 'delete'. Menu items were standadised across several different applications like Word and, later, powerpoint. But the scripting process (certainly in the early days) was a lot less effective. It seemed the ease of automation and interaction had been sacrificed for the more intuitive nature of the interface for a novice user. Although from my perspective, it felt like a step backwards. This movememnt from computer as a programmed tool to computer as a host for preprogrammed applications targeted at non technical users continued over many years. But whilst initial interface became more user friendly, automation became harder. Further, the capabilities of the applications that were developed became compartmentalised. Pooling data across applications had to be achieved through particular libraries (I believe the Microsoft ones were called ActiveX). Interfacing with data became more complicated. The balance between a script based on a series of sequential commands and the intuitive interface provided by a GUI headed towards the GUI. Scripting in Excel became more like programming, even as spreadsheets became more accessible.
Across this time Comptuer Science degrees have focussed less on makng programming languages more usable and more on
This process has led to a separation between those who can handle complex operations on data (typically stored in a spreadsheet) and those that can't. Spreadsheets are easy to understand and give simple functionality for processing and visualising data. Unfortunately, whilst spreadsheets give a convenient way to operate on a table of data (or even perform fairly complex modelling computations) they are a limited paradigm when it comes to data storeage or processing of large quantities of data.
One of the other students at my summer job worked with a database system called Paradox. We used to argue over lunch about what was the better form of software (in those days I was less willing to accept that one tool may be appropriate for one role and another tool for another). Peter Bennet would show me his databases in Paradox, and I would ask him how me might perform a particular calculation on the data, or model the data in a particular way. It seemed you couldn't do that, all you could do was summarise the data in tables or reports. Reports like these (coming from the company mainframe or the expenses system) were often the very things I had trouble automatically processing in my spreadsheets. I'd written many macros to parse the text files output from these reports\footnote{These reports came from the company's centralized computer services, based in Florham Park, New Jersey, in a building that was later occupied by AT&T Bell Labs}. I needed to process this data in the computer and when reading the ASCII coded outputs from these mainframes I had to account for many exceptions, like reports that spread over multiple 'pages' (the reports were originally designed to be printed for human reading). Why didn't Paradox have the functionality of a spreadsheet, which would allow me to do the modelling alongside the storeage? It seemed to me that databases where a dieing breed, and the brave new world of the spreadsheet was to take over. This impression was left unaffected by my undergraduate degree in Engineering, where the only programming language we were taught was QuickBasic, the Microsoft version of basic which seemed no better suited to data processing than the BBC Basic I'd enjoyed as an 11 year old. While disappointed that we weren't given access to Lotus 1-2-3 (only a freeware clone called 'As-Easy-As') it was still clear to me that for data processing a spreadsheet held the upper hand. This impression continued as I worked my summer's in the accounts department at Esso. I assumed the database must be an anachronism of a bygone era of mainframe computing, soon to be replaced by the new world of spreadsheets and PCs. Of course I was very much mistaken. Firstly, because I didn't understand the importance of throughput that would make mainframes very difficult to replace. For example, large companies' payroll systems are still run on mainframes today. They are reliable and can handle massive throughput. Underpinning these systems are stil traditional databases, although they have much more powerful underlying engines than the Paradox database Peter Bennett showed me.
Of course, as with many things, we may have been both right according to our perspectives, Peter had a summer job that involved simple processing of (perhaps) a large amount of data. My summer job involved constructing budget models, enabling my boss to predict what the billing rate should be for next year.
I remember very clearly that back in 1990 Esso Engineering Europe Ltd (EEEL) used to bill their affiliates $147 dollars per hour. The affiliates were refineries across Europe who called in particular expertise to resolve problems with their processes. The billing rate was high because a lot of the expertise was brought in by bringing Engineers on assignment from the US. The American engineers cost more because they were paid more whilst on assignment, and the company also covered trips home and schooling at the local 'American High School in Cobham. Income was based on the number of 'billable hours' for the engineers. These billable hours were reduced by time spent training, in meetings, on vacation or holiday (I think holidays were distinct from vacation due to them being imposed, like Christmas, whereas vacation was taken by the engineer). We called all these categories 'symbol time' because in the data base they were represented by symbols ('M' for meetings, 'T' for training, 'S' for sickness, 'H' for holiday and 'V' for vacations). Because EEEL was a subsidurary company of Esso it's objective was to break even. Two big factors affected the billing rate: the costs, and the number of billable hours. We made an assumption that the billable hours would all be used, so the billing rate was (in the end) the total costs divided by the number of billable hours. The spreadsheet model I built allowed Tony, my boss, to manipulate the symbol time, the corporate balance of Americans and British Engineers (there were also Europeans who were on assignment from the affiliates). All this was very nicely handled in the spreadsheet.
Of course, Tony and I weren't the first peole to do a bit of modelling: mathematicians such as Laplace and Gauss formulated predictions about planetary orbits without the aid of a spreadsheet. In fact, they didn't even have the aid of the type of mathematical notation that we use today to abstract these problems.
A classical method that dates back to those times is least squares.
In [ ]:
least squares paper
In [4]:
In [ ]: