This notebook introduces some basic ideas about data, and illustrates them with a number of examples of different types of data.
It does not require any prior background.
So what is data? The term is used in so many ways, it's often hard to pin down what people mean. Here is what Wikipedia says:
Data is uninterpreted information.
This is somewhat helpful, but also a bit cryptic, since we aren't told what it means to interpret information. Indeed, it is often suggested that an act of interpretation is required to go from data to information:
Data are the facts or details from which information is derived. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into context.
Here's a longer passage about 'raw data' from the Wikipedia article on data:
Raw data, i.e. unprocessed data, is a collection of numbers, characters; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next.
This is more useful, since it tells us that data can somehow be 'processed' and possibly transformed into something else — we'll see some examples of processing data as we go through this lesson. The Wikipedia article also points out that what counts as data is relative to the context.
Let's try to get a clearer picture by looking at some examples, involving both text and numbers.
Suppose I decide to keep a diary about the food I eat. This could be pretty informal, something a bit like this:
Monday
------
bfast: toast and jam
lunch: tomato soup and roll
supper: baked beans, sushi, treacle tart
Tuesday
-------
bfast: porridge with soya milk
lunch: tomato soup and roll
supper: peri-peri chicken, chips, coke
Despite being informal, it's still good enough to count as data about my diet.
Let's think briefly what we could do with this data. One possibility is that we could try to identify each of the dishes and categorise them by ingredient, say in terms of grains, pulses, meat, spices and so on. Categorising the data items in this way would be one example of processing data. Earlier on, we talked about data "being transformed into something eles". In this example, the "something else" might be an answer to the question: Do I have a balanced diet?
The next example of data involves some quantities.
Here's a slightly different kind of diary, recording my running exploits in the first half of December:
5/12/15 4.5km
7/12/15 3.1km
12/12/15 8.6km
So here we have data that combines two types of information: dates and distances. It's important to know that these are different kinds of data elements. For example, we know that we can add together the three distances, to get a total of 16.2km. By contrast, trying to just add the dates together to get a total doesn't make sense (although we could do something more fancy to find out the total number of days covered by the diary).
Let's continue with another data example that uses numbers.
What about this list of numbers?
23.87
19.85
19.22
28.93
29.41
22.23
23.50
24.95
Who knows? Apart from the fact that the numbers are in a narrow range, it's pretty much impossible to guess what this information is about.
Here are the same numbers, but with more information added:
Year Days of rainfall
-----------------------
2004 23.87
2005 19.85
2006 19.22
2007 28.93
2008 29.41
2009 22.23
2010 23.50
2011 24.95
So now we see that we have got a time series: a sequence of data points measured at different times — in this case, in successive years. The two columns have been given labels which tell us what the time points are, and what kind of quantity has been measured. We could also specify not just when but where the measurements were taken, namely in Edinburgh.
Rainfall data is objective in the sense that it's the result of an observer measuring physical quantities. Ideally, two different observers taking the same measurements would record the same data.
Now that we know more about the data, we can think of ways of processing it. For example, we could:
and so on.
The information which tells us things like dates, location, the kind of quantity, etc. is sometimes called metadata: it's data about data.
Play around with different ways of 'processing' the rainfall data along the lines suggested above.
Find another example of time series data. Find or make-up some data points that are part of the series.
Find another list of numbers like the one above which is not time series data. What metadata would have to be present to make sure that someone else understands what the data is about?
We often represent data in the form of rows and columns. That's what we mean when we talk about a data table (or tabular data). So the rainfall data above had two columns and eight rows, plus a header row.
Public bodies collect lots of data about all manners of things. More and more, they have been making this available as open data to anyone that wants to use it. Most of the time, the data is provided as some kind of table that can be downloaded over the internet. Here's an example of data about Scottish schools which I've already downloaded for you. We're doing a bit of extra magic to make it easy to display the data, but you can ignore this for the time being.
In [6]:
from dds_lab import *
schools_csv = pd.read_csv(schools)
schools_csv.head(10)
Out[6]:
Let's just briefly look through this table. The first column is not in fact part of the dataset, but is just there to help us keep track of which row is which. The second column can be ignored for now, but is a standardised way of giving a unique identifier to each school, whose conventional name can be found in the third column. The fifth and sixth columns contain the geographical coordinates of each school; as we'll see later, this is really helpful since it allows us to plot the locations of the schools on a map. Finally, the sixth column shows us the number of pupils.
In the code cell above, the last line is:
schools_csv.head(10)
This tells us to just look at the first 10 rows of the file. If you want to see (say) 20 rows of the file, replace the line with the following and execute the cell:
schools_csv.head(20)
Alternatively, if you want to see the whole table, replace the line with this:
schools_csv
We briefly mentioned earlier that data resulting from observation and measurement of physical properties is regarded as objective. By contrast, people's views and feelings cannot be reliably be identified by just observing them, and we don't have tools for repeatably measuring thoss views and opinions. Information collected by asking people about their perceptions, thoughts, emotions, values and so on is classed as subjective data.
Of course, subjective data is important, and a lot of effort goes into trying to collect it in a robust and reliable way. One techniques involves questionnaires, and we are often requested to fill these in. Within Edinburgh, the Council uses an interview-based questionnaire to carry out an extensive survey of residents:
The Edinburgh People Survey (EPS) is the Council's annual citizen survey, measuring satisfaction with the Council and its services, identifying areas for improvement and gathering information about residents which is not available through other sources or at neighbourhood level.
The survey is undertaken through face-to-face interviews with around 5,000 residents each year, conducted in the street and door-to-door.
After collecting peoples' opininions, their answers are put into a big database. Below, we show a tiny extract in tabular form from the 2013 survey. Each row in the table corresponds to the responses of one resident, and each column represents the answers to a particular question on the survey.
In [2]:
eps_csv = pd.read_csv(eps_extract)
eps_csv
Out[2]:
Some of the answers shown here are impossible to interpret without knowing what questions were asked, so here are column labels paired with the relevant survey questions:
NEI001: Thinking of your neighbourhood area, by which I mean the area within a 15 minute walk of your home, how satisfied or dissatisfied are you with this area as a place to live?
NEI002: What should be the top priority for improving the quality of life in your neighbourhood?
NEI003: Do you feel that you are able to have a say on things happening or how Council services are run in your local area (neighbourhood or community)?
NEI032: How safe do you feel in your neighbourhood after dark?
NEI040: To what extent are you satisfied or dissatisfied with the way the Council is managing your neighbourhood?
COU001: To what extent are you satisfied or dissatisfied with the way the Council is managing the City?
COU002: Why do you say this?
Since the Meadows/Morningside area is one of the more desirable areas of Edinburgh, and given that Edinburgh is sometimes rated as one of the most livable cities in the UK, it's intriguing how lukewarm about their neighbourhood these respondents were!
Questions of the form "How X ...?" or "To what extent ...?" invite the respondent to give an answer somewhere on a scale. One popular way of framing the responses to such questions uses a Likert scale such as that illustrated here:
Very dissatisfied
Fairly dissatisfied
Neither satisfied nor dissatisfied
Fairly satisfied
Very satisfied
Is it OK to convert the answers on a Likert scale into numbers, where Very dissatisfied is replaced by 1, Fairly dissatisfied is replaced by 2, and so on? If you did this, then it leads to further questions such as:
Is the "distance" between, say, Fairly dissatisfied and Fairly satisfied really the same as the "distance" between Neither satisfied nor dissatisfied and Very satisfied?
Does it make sense to calculate an average "level of satisfaction" by taking the arithmetic mean of the corresponding numbers?
After you've thought about these questions, have a look at this blog post on Likert scales.
Although we cannot measure emotion in a direct way, observations can provide evidence for emotional states, as illustrated in this picture from Darwin's book The Expression of the Emotions.
On a more food-related note, the following photo provides information about the type of snacks provided for students attending a five-day hackathon in 2013:
In some contexts, we might want to treat information shared via social media as data. For example, we could sample Twitter to see what kinds of things people are currently saying about food. In this example, we'll look briefly at 100 Tweets that were collected from the public Twitter stream, filtered so that they all contain the word "food". If you're interested, we used the NLTK Twitter library to retrieve the Tweets as follows:
import nltk # load up the NLTK library
from nltk.twitter import Twitter
tw = Twitter() # start a new client that connects to Twitter
tw.tweets(keywords='food', to_screen=False, limit=100) #filter Tweets from the public stream
(Warning: you will only be able to re-run this code yourself if you have followed these instructions about obtaining Twitter API keys.)
Now that we've stored the Tweets in a file, we can print the text contents as follows:
In [3]:
from dds_lab import twitter_files
from nltk.corpus import TwitterCorpusReader
reader = TwitterCorpusReader(twitter_files,'.*\.json')
for text in reader.strings():
print(text)
As you can see, this sample of Twitter messags is very varied, and using Twitter to gain useful information about a specific topic often yields unpredictable and strange results. On the other hand, it can also be revealing and quite fun.
As we mentioned at the outset, what gets categorised as "raw data" depends very much on the context. Here's a final example to bring home this point. In this extract from the novel Thinks ... by David Lodge (2001), the narrator is a cognitive scientist who describes an exercise in which he speaks aloud whatever comes into his head into a tape recorder:
The object of the exercise being to try and describe the structure of, or rather to produce a specimen, that is to say raw data, on the basis of which one might begin to try to describe the structure of, or from which one might infer the structure of ... thought.
This quotation is not meant be taken too seriously. Nevertheless, introspection — observation of one's own thoughts and feelings — has a rich history in psychology.
This notebook has quickly peeked at different kinds of things that might be described as data. We have seen that it comes in lots of forms and is not always easy to interpret. The overview is not intended to be exhaustive, but hopefully it's given you a feel for the kinds of things that you might encounter when using data in different kinds of research.
In trying to answer the question "what is data?", we need to consider the context, since what counts as data depends on people's goals and intentions. In everyday life, data is sometimes a factor in the decisions that we make. But our decisions are usually based on a variety of factors, including emotions, habits, beliefs and evidence.
When evidence comes into play, then we try to extract information from a variety of sources, including our perceptions of the world. For example, I decide how to get to work, and what clothes to wear, based on my expectations of today's weather. I can look out of the window to see rain splattering down, and I can hear the rush of the wind. So in this case, 'sense data' — what I see and hear — helps provide me with information on the basis of which I can make a decision. Of course, the relationship between data and evidence is a complex one, and it needs to be treated more fully in its own right. The point here is that information is the result of us interpreting 'raw data' in such a way that it can be input to our decision-making process.
Typically, we collect data that tels us something about the past. For example, if I happen to have a rain meter installed in my garden, it could tell me how much rain fell overnight. While this doesn't directly give me information about what the weather is likely to be like today, collecting and analysing such data allows us to detect patterns. When we have found such patterns, we are often in a better position to predict future events.
In this notebook, we looked briefly at the following terminology:
If you're not sure what they mean, go back and read the notebook again.
In [ ]: