A one-hour or less tour of Redis.
tl:dr version:
Redis stands for REmote DIctionary Server. The Try Redis app is easy for a quick tour; for a few more details, read the introduction to data types.
Redis is a server process that you connect to with a client. On your VM, you can start it with the redis-server
command, but it's best to run it in its own terminal, or to start it with the server.
For our VM, the server is already running. You just need to connect to it with a client. You can do this in the shell using redis-cli
, at which point you can send and receive commands directly to Redis.
Here, though, let's use it with Python. We'll probably want some of Python's other facilities to read files, control flow, manage variables, etc.
Note: this is a python 2 notebook.
In [ ]:
import redis
Note: If this happens to you, just do this in the shell:
% sudo apt-get install python-redis
(Your password is "vagrant".)
In [ ]:
import redis
Remember we need to connect to the server, using Python as the client, just like we would connect to a database server. This will connect using the default port and host, which the Redis server on our VMs uses.
In [ ]:
r = redis.StrictRedis()
The simplest use of Redis is as a key-value store. We can use the get
and set
commands to stash values for arbitrary keys.
In [ ]:
r.set('hi', 5)
In [ ]:
r.get('hi')
In [ ]:
r.get('bye')
In [ ]:
r.set('bye', 500)
In [ ]:
r.get('bye')
Not particularly fancy, but useful.
Why is this different from just using Python variables? For one thing, it's a server, so you can have multiple clients connecting.
In [ ]:
r2 = redis.StrictRedis()
r2.get('bye')
In [ ]:
r2.set('new key', 10)
In [ ]:
r.get('new key')
r
and r2
could be different programs, or different users, or different languages. Much like a full RDBMS environment, the server backend supports multiple concurrent users. Unlike an RDBMS, though, Redis doesn't have the same sophisticated notion of access controls, so any connecting client can access, change, or delete data.
Just storing keys and values on a server still isn't terribly exciting. Keep in mind that Redis is a data structure server. With that in mind, it's more interesting to look at some of its data structures, such as counters, which (unsurprisingly) track and update counts of things.
In [ ]:
r.get('hi')
In [ ]:
# increment the key 'hi'
r.incr('hi')
In [ ]:
r.incr('hi')
In [ ]:
r.incr('hi', 20)
In [ ]:
r.decr('hi')
In [ ]:
r.decr('hi', 3)
In [ ]:
r.get('hi')
Internally, Redis stores strings, so keep in mind that you'll have to cast values before doing math.
In [ ]:
r.get('hi') * 5
In [ ]:
int(r.get('hi')) * 5
Counters are just the beginning. Next, we have sets:
In [ ]:
r.sadd('my set', 'thing one')
In [ ]:
r.sadd('my set', 'thing two', 'thing three', 'something else')
In [ ]:
r.smembers('my set')
In [ ]:
r.sadd('another set', 'thing two', 'thing three', 55, 'thing six')
In [ ]:
r.smembers('another set')
In [ ]:
r.sinter('my set', 'another set')
In [ ]:
r.sunion('my set', 'another set')
And it's python, so we can do obvious things like:
In [ ]:
len(r.smembers('my set'))
In [ ]:
[x.upper() for x in r.smembers('my set')]
See what's going on here? Redis stores data structures as a server, but you can still manipulate those structures as if there were any other python variable. The differences are that they live on the server, so can be shared, and that this requires communication overhead between the client and the server.
So doesn't that slow things down? Doesn't python already have a set()
built-in type? (Yes, it does.) Why is it worth the overhead?
More interesting, perhaps, are sorted sets.
In [ ]:
r.zadd('sorted', 5, 'blue')
r.zadd('sorted', 3, 'red')
r.zadd('sorted', 7, 'purple')
r.zadd('sorted', 10, 'pink')
r.zadd('sorted', 6, 'grey')
In [ ]:
r.zrangebyscore('sorted', 0, 10)
In [ ]:
r.zrevrangebyscore('sorted', 100, 0, withscores=True)
In [ ]:
r.zrank('sorted', 'red')`
In [ ]:
r.zincrby('sorted', 'red', 5)
In [ ]:
r.zrevrangebyscore('sorted', 100, 0, withscores=True)
In [ ]:
r.zrank('sorted', 'red')
Here we've created a set that stores scores and automatically sorts the set members by scores. You can add new items or update the scores at any time, and fetch the rank order as well. Think "top ten anything".
In [ ]:
r.zadd('sales:10pm', 3, 'p1')
r.zadd('sales:10pm', 1, 'p3')
r.zadd('sales:10pm', 12, 'p1')
r.zadd('sales:10pm', 5, 'p2')
r.zadd('sales:11pm', 4, 'p1')
r.zadd('sales:11pm', 8, 'p2')
r.zadd('sales:11pm', 5, 'p2')
r.zadd('sales:11pm', 2, 'p1')
r.zadd('sales:11pm', 7, 'p1')
csvkit
alone won't do the math for you, though csvsql could help. You could load your orders into R and do it, but perhaps you don't remember R and dplyr commands. In a little loop of python, you can throw all this data at Redis and it will return answers to useful questions.
In [ ]:
r.zrevrangebyscore('sales:10pm', 100, 0, withscores=True)
In [ ]:
r.zrevrangebyscore('sales:11pm', 100, 0, withscores=True)
In [ ]:
r.zunionstore('sales:combined', ['sales:10pm', 'sales:11pm'])
In [ ]:
r.zrevrangebyscore('sales:combined', 100, 0, withscores=True)
In [ ]:
import csv
MAX_COUNT = 10000
count = 0
fp = open('bikeshare-q1.csv', 'rb')
reader = csv.DictReader(fp)
In [ ]:
while count < MAX_COUNT:
ride = reader.next()
r.zincrby('start_station', ride['start_station'], 1)
r.zincrby('end_station', ride['end_station'], 1)
r.rpush('bike:%s' % ride['bike_id'], ride['end_station'])
count += 1
In [ ]:
r.zrevrangebyscore('start_station', 10000, 0, start=0, num=10, withscores=True, score_cast_func=int)
In [ ]:
print 'last bike seen:', ride['bike_id']
r.lrange('bike:%s' % ride['bike_id'], 0, 50)
This is just scratching the surface. There are Redis client libraries for most major language you might use. The Redis docs include many more data structure types, including some very sophisticated ones, few of which you would want to code up yourselves. The docs also go into detail about getting a ton of performance out of Redis with a few short tips (the example above was way slower than Redis normally is!). Keep in mind that Redis runs in memory, so you have to be thoughtful about what you want to persist if you want data to stick around after a restart (see the docs for more). And Redis can be used for very different kinds of models, like pub/sub for inter-process communication. It works very well for this. It's also a useful backend for queueing (see RQ or Celery for examples), another easy-to-understand distributed processing framework.