Getting Data into your notebook.

The following are examples of how to deal with various data sources. Feel free to grab this notebook, or portions, and adapt as needed.

In R, getwd() and setwd() allow you to see where you're working and the ability to change your working directory. The following will allow us to see the current working directory and change as needed.


In [58]:
import os

os.getcwd()


Out[58]:
'/Users/Bryan/Desktop'

In [59]:
os.chdir('/Users/Bryan/Desktop')
os.getcwd()


Out[59]:
'/Users/Bryan/Desktop'

Reading in csv files.


In [66]:
#Pandas will be your primary method for pulling in .csv files.

import pandas as pd


Twitter = r'tweets_from_bigdata_[2014-10-16_16h14_to_2014-11-25_15h15]_p79350-1416925145.csv'
tweetframe = pd.read_csv(Twitter)

In [67]:
tweetframe


Out[67]:
Tweet ID Date (UTC) From (@username) From (name) Text Geo coordinates
0 522752475133411000 10/16/14 14:14 TanejaGroup Taneja Group TG Blog | @BlueDataInc : #BigData #Analysis Cl... NaN
1 522752480212692000 10/16/14 14:14 jojennings Joanne Jennings RT @UnaBoylan: This is how #BigData could help... NaN
2 522752486579634000 10/16/14 14:14 IEHeather Heather James Free online magazine: Big Data Innovation, Iss... NaN
3 522752489477906000 10/16/14 14:14 SarahCtn Sarah Cr̩tinon RT @orange: Le mobile, la solution aux ̩pid̩mi... NaN
4 522752522633494000 10/16/14 14:14 strataconf O'Reilly Strata RT @mphnyc: @BobMankoff onstage at #HadoopWorl... NaN
5 522752568640815000 10/16/14 14:14 Space_Plowboy Terry Griffin I will be speaking on #bigdata in #precisiona... NaN
6 522752574953619000 10/16/14 14:15 SAPInMemory SAP HANA .@CSX @ConvergenceCT @CSC are changing the way... NaN
7 522752578682368000 10/16/14 14:15 MobilityWise Accenture Mobility Today at 1230 pm ET, #Accenture is hosting a t... NaN
8 522752611209183000 10/16/14 14:15 IEDeanna Deanna Notice RT @IEHeather: Free online magazine: Big Data ... NaN
9 522752629429260000 10/16/14 14:15 IEHeather Heather James 11 Ways Data Has Changed How We Travel http:/... NaN
10 522752630184239000 10/16/14 14:15 p2pWebMobileIt p2p WebMobileIT Solutions Architect (3 month contract) M/F w/ ... NaN
11 522752632298176000 10/16/14 14:15 bobehayes Bob E. Hayes, PhD Is #BigData squishing our humanity? http://t.c... NaN
12 522752636043657000 10/16/14 14:15 caelenface Caelen Dwane RT @UnaBoylan: This is how #BigData could help... NaN
13 522752640485429000 10/16/14 14:15 IKnowBigData Know Big Data The Free #bigdata #hadoop session is about to ... NaN
14 522752641705586000 10/16/14 14:15 neeraj_malviya Neeraj Malviya RT @KirkDBorne: #MachineLearning #BigData Rese... NaN
15 522752682839515000 10/16/14 14:15 RunningMBA Jennifer Havens Just sat through a great session on agile anal... NaN
16 522752747792531000 10/16/14 14:15 MedicReS MedicReS #LIVE... The Age of #BigData @nytimes , Februa... NaN
17 522752751139573000 10/16/14 14:15 SASCanada SAS Canada #NEWS! Toronto Maple Leafs Partners with @SASC... NaN
18 522752763189821000 10/16/14 14:15 NetFaculty Network Faculty Chief analytics officer: The ultimate #bigdata... NaN
19 522752825521352000 10/16/14 14:16 martingallen Martin Allen Aggregate>Enrich>Analyse #datascience #b... NaN
20 522752872011038000 10/16/14 14:16 KamilIsaev1 Kamil Isaev We had a special guest today\rKarin Breitman, ... NaN
21 522752875278389000 10/16/14 14:16 TungstenBigData TungstenBigData Information Builders Announces Omni-Patient Pr... NaN
22 522752887651586000 10/16/14 14:16 TungstenBigData TungstenBigData Datawatch Discusses the Future of Visualizatio... NaN
23 522752888708534000 10/16/14 14:16 AlsLou Aless Loayza Que tremendo potencial tiene qlik datos y mas... NaN
24 522752896174411000 10/16/14 14:16 TungstenBigData TungstenBigData ScaleOut Software Releases Version 5.2 of Its ... NaN
25 522752900611964000 10/16/14 14:16 andrekearns Andre Kearns The Quiet Rise of the National Geospatial-Inte... NaN
26 522752908350480000 10/16/14 14:16 amasoliverdilme Albert Masoliver Collaborations and correlations in the common ... NaN
27 522752910795358000 10/16/14 14:16 FadilaMM Insight Story Flux Vision: ̩tudes de march̩ et croisement de... NaN
28 522752913257795000 10/16/14 14:16 JGMARTINEZOCHOA JGMARTINEZOCHOA#1 RT @couchbase: PayPal uses #BigData to their a... NaN
29 522752925186392000 10/16/14 14:16 DHenschen Doug Henschen Azure #bigdata service goes real time @Microso... NaN
... ... ... ... ... ... ...
390011 537247271511797000 11/25/14 14:11 cuongcz Cuong Alan FICO Improves Its #BigData Score with Trio of ... NaN
390012 537247288020979000 11/25/14 14:11 biconnections BIconnections RT @bjonesnDC: Why Predictive Analytics is Bet... NaN
390013 537247304622026000 11/25/14 14:11 maxidamico Max D' Amico RT @bryantafel: Los tel̩fonos celulares y la c... NaN
390014 537247310431133000 11/25/14 14:11 esselinj Jack Esselink A Match Made Somewhere: Big Data and the Inter... NaN
390015 537247363807469000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @cuongcz: FICO Improves Its #BigData Score ... NaN
390016 537247364839260000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @jainrasik: "How to weather the big data st... NaN
390017 537247365464207000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @kimdossey: The Hive #BigData Think Tank De... NaN
390018 537247366735089000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @VishalTx: 10 Prerequisites Before Getting ... NaN
390019 537247367414562000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @mikepluta: Top story from @CRN #BigData St... NaN
390020 537247368253419000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @IDGMobility: Goldman Sachs Invests in Big ... NaN
390021 537247368932900000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @IE_BigData: Three Ways To Use Big Data To ... NaN
390022 537247369532694000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @jmtwn: Up your game with Digital Transform... NaN
390023 537247370216767000 11/25/14 14:12 Julian0Bro Julian #BigData is a term just like #Talent in HR &gt... NaN
390024 537247370921410000 11/25/14 14:12 michaelyoungMBN Michael Young How To Weather The Big Data Storm http://t.co/... NaN
390025 537247370946154000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @HP: #BigData is revealing big numbers on h... NaN
390026 537247371575300000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @HITAnalytics: #EHR #DataAnalytics Flag Hid... NaN
390027 537247387862179000 11/25/14 14:12 Hoorge Harjit Dhaliwal RT @MSFTnews: See how @CarnegieMellon is using... NaN
390028 537247400545353000 11/25/14 14:12 accusoftinfoway Accusoft Infoways RT @ade_carr: Data Is Good, 'Bidirectionalized... NaN
390029 537247404550938000 11/25/14 14:12 Greyhound_R Greyhound Research #BigData impacts one and all #Financial #Healt... NaN
390030 537247420389015000 11/25/14 14:12 jwilhelmi John Wilhelmi RT @MSFTnews: See how @CarnegieMellon is using... NaN
390031 537247518464020000 11/25/14 14:12 Greyhound_R Greyhound Research #BigData is a reality and affects customer eng... NaN
390032 537247715458293000 11/25/14 14:13 telecomitaliaTw TelecomItaliaGroup Al #DemoDay14 di @workingcapital @stellaromagn... NaN
390033 537247769376079000 11/25/14 14:13 CarinaJenkins Carina Jenkins RT @LondonInfoInter: Connecting Knowledge Silo... NaN
390034 537247823486390000 11/25/14 14:13 JuniorHendry Allen RT @MSFTnews: See how @CarnegieMellon is using... NaN
390035 537247911374225000 11/25/14 14:14 iotattack IoT Attack #IoT Bitdefender Unveils IoT Security Applianc... NaN
390036 537247915572731000 11/25/14 14:14 iotattack IoT Attack #IoT Rogers Pledges CAD $4M to Spur Growth of ... NaN
390037 537247934346457000 11/25/14 14:14 iotattack IoT Attack IoT & Big Data\rhttps://t.co/1BNCQ5VEVs\r#... NaN
390038 537247940423987000 11/25/14 14:14 Fidanto Antonella #bigdata #programmatico sembra di sentir parla... NaN
390039 537248046401081000 11/25/14 14:14 LogicPD Logic PD RT @GerardoNZ: Machines do analytics; humans d... NaN
390040 537248054319906000 11/25/14 14:14 Greyhound_R Greyhound Research Do not talk to your business about #BigData #I... NaN

390041 rows × 6 columns

Downloading data off the web. In this case, we'll use the fixed speed cameras in Baltimore.


In [23]:
import urllib2

# downloading the file as ..._camera.xls and save it in a subfolder of your choosing.
fileUrl = 'https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xls?accessType=DOWNLOAD'
f = urllib2.urlopen(fileUrl)
data = f.read()
with open('Baltimore_Fixed_Speed_Cameras.xls', 'wb') as w:
    w.write(data)

# load the Excel file as a pandas DataFrame
baltData = pd.ExcelFile('Baltimore_Fixed_Speed_Cameras.xls')
baltData = baltData.parse('Baltimore Fixed Speed Cameras', index_col=None, na_values=['NA'])
baltData.head()


Out[23]:
address direction street crossStreet intersection Location 1
0 S CATON AVE & BENSON AVE N/B Caton Ave Benson Ave Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
1 S CATON AVE & BENSON AVE S/B Caton Ave Benson Ave Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
2 WILKENS AVE & PINE HEIGHTS AVE E/B Wilkens Ave Pine Heights Wilkens Ave & Pine Heights (39.2720252302, -76.676960806)
3 THE ALAMEDA & E 33RD ST S/B The Alameda 33rd St The Alameda & 33rd St (39.3285013141, -76.5953545714)
4 E 33RD ST & THE ALAMEDA E/B E 33rd The Alameda E 33rd & The Alameda (39.3283410623, -76.5953594625)

Super handy JSON

Again, as an example, we'll generate JSON from the same Baltimore website


In [32]:
import json

# go and get your data
fileUrl = 'https://data.baltimorecity.gov/api/views/dz54-2aru/rows.json?accessType=DOWNLOAD'
req = urllib2.Request(fileUrl)
pull = urllib2.build_opener()
f = pull.open(req)

# read it in
baltJson = json.loads(f.read())

# json as a dictionary
print baltJson['meta']['view']['id']
print baltJson['meta']['view']['name']
print baltJson['meta']['view']['attribution']


dz54-2aru
Baltimore Fixed Speed Cameras
Department of Transportation

Writing Data

Sometimes, you'll want to create a subset of a given data source. The following simply reads in the same Baltimore data, creates a subset, and saves it off to a different .csv file.


In [30]:
# first read the csv file.  PLEASE note that this is dealing with .csv files only.  
#If you're reading from an xls or xlsx, it can cause formatting issues.
cameraData = pd.read_csv('Baltimore_Fixed_Speed_Cameras.csv')

# take a subset of the columns
grabbedData = cameraData.ix[:,3:]

# then save it to a different csv file
# this is equivalent to R's write.table() command
grabbedData.to_csv('baltimore_subset.csv', sep=',', index=False)

newData = pd.read_csv('baltimore_subset.csv')
newData.head()


Out[30]:
crossStreet intersection Location 1
0 Benson Ave Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
1 Benson Ave Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
2 Pine Heights Wilkens Ave & Pine Heights (39.2720252302, -76.676960806)
3 33rd St The Alameda & 33rd St (39.3285013141, -76.5953545714)
4 The Alameda E 33rd & The Alameda (39.3283410623, -76.5953594625)

Placeholder for web scraping.


In [ ]: