A brief tour of Redis

A one-hour or less tour of Redis.

tl:dr version:

If you don't have time to read/run this, go to Try Redis and try it yourself.
Redis is a data structure server. Not quite a database, not quite a key-value store.
It is very fast and is a great tool for rapid analysis and other cases when you need something more than "just python" or "just R" but don't want to take the time to define and implement an RDBMS schema, etc.
You have it on your VM.

Redis stands for REmote DIctionary Server. The Try Redis app is easy for a quick tour; for a few more details, read the introduction to data types.

Getting started

Redis is a server process that you connect to with a client. On your VM, you can start it with the redis-server command, but it's best to run it in its own terminal, or to start it with the server.

For our VM, the server is already running. You just need to connect to it with a client. You can do this in the shell using redis-cli, at which point you can send and receive commands directly to Redis.

Here, though, let's use it with Python. We'll probably want some of Python's other facilities to read files, control flow, manage variables, etc.

Note: this is a python 2 notebook.



In [ ]:

    
import redis

Note: If this happens to you, just do this in the shell:

% sudo apt-get install python-redis

(Your password is "vagrant".)



In [ ]:

    
import redis

Remember we need to connect to the server, using Python as the client, just like we would connect to a database server. This will connect using the default port and host, which the Redis server on our VMs uses.



In [ ]:

    
r = redis.StrictRedis()

The simplest use of Redis is as a key-value store. We can use the get and set commands to stash values for arbitrary keys.



In [ ]:

    
r.set('hi', 5)



In [ ]:

    
r.get('hi')



In [ ]:

    
r.get('bye')



In [ ]:

    
r.set('bye', 500)



In [ ]:

    
r.get('bye')

Not particularly fancy, but useful.

Why is this different from just using Python variables? For one thing, it's a server, so you can have multiple clients connecting.



In [ ]:

    
r2 = redis.StrictRedis()
r2.get('bye')



In [ ]:

    
r2.set('new key', 10)



In [ ]:

    
r.get('new key')

r and r2 could be different programs, or different users, or different languages. Much like a full RDBMS environment, the server backend supports multiple concurrent users. Unlike an RDBMS, though, Redis doesn't have the same sophisticated notion of access controls, so any connecting client can access, change, or delete data.

More than just keys and values - basic data structures

Just storing keys and values on a server still isn't terribly exciting. Keep in mind that Redis is a data structure server. With that in mind, it's more interesting to look at some of its data structures, such as counters, which (unsurprisingly) track and update counts of things.



In [ ]:

    
r.get('hi')



In [ ]:

    
# increment the key 'hi'
r.incr('hi')



In [ ]:

    
r.incr('hi')



In [ ]:

    
r.incr('hi', 20)



In [ ]:

    
r.decr('hi')



In [ ]:

    
r.decr('hi', 3)



In [ ]:

    
r.get('hi')

Internally, Redis stores strings, so keep in mind that you'll have to cast values before doing math.



In [ ]:

    
r.get('hi') * 5



In [ ]:

    
int(r.get('hi')) * 5

Counters are just the beginning. Next, we have sets:



In [ ]:

    
r.sadd('my set', 'thing one')



In [ ]:

    
r.sadd('my set', 'thing two', 'thing three', 'something else')



In [ ]:

    
r.smembers('my set')



In [ ]:

    
r.sadd('another set', 'thing two', 'thing three', 55, 'thing six')



In [ ]:

    
r.smembers('another set')



In [ ]:

    
r.sinter('my set', 'another set')



In [ ]:

    
r.sunion('my set', 'another set')

And it's python, so we can do obvious things like:



In [ ]:

    
len(r.smembers('my set'))



In [ ]:

    
[x.upper() for x in r.smembers('my set')]

See what's going on here? Redis stores data structures as a server, but you can still manipulate those structures as if there were any other python variable. The differences are that they live on the server, so can be shared, and that this requires communication overhead between the client and the server.

So doesn't that slow things down? Doesn't python already have a set() built-in type? (Yes, it does.) Why is it worth the overhead?

More interesting data structures

More interesting, perhaps, are sorted sets.



In [ ]:

    
r.zadd('sorted', 5, 'blue')
r.zadd('sorted', 3, 'red')
r.zadd('sorted', 7, 'purple')
r.zadd('sorted', 10, 'pink')
r.zadd('sorted', 6, 'grey')



In [ ]:

    
r.zrangebyscore('sorted', 0, 10)



In [ ]:

    
r.zrevrangebyscore('sorted', 100, 0, withscores=True)



In [ ]:

    
r.zrank('sorted', 'red')`



In [ ]:

    
r.zincrby('sorted', 'red', 5)



In [ ]:

    
r.zrevrangebyscore('sorted', 100, 0, withscores=True)



In [ ]:

    
r.zrank('sorted', 'red')

Here we've created a set that stores scores and automatically sorts the set members by scores. You can add new items or update the scores at any time, and fetch the rank order as well. Think "top ten anything".

A note on keys

The keys we used are named as you wish. So, for example, you can define key naming conventions that, for example, add identifiers to the keys for easy programmatic use. Let's say you're churning through a log of product sales orders and want to count the top sales for a given hour.



In [ ]:

    
r.zadd('sales:10pm', 3, 'p1')
r.zadd('sales:10pm', 1, 'p3')
r.zadd('sales:10pm', 12, 'p1')
r.zadd('sales:10pm', 5, 'p2')
r.zadd('sales:11pm', 4, 'p1')
r.zadd('sales:11pm', 8, 'p2')
r.zadd('sales:11pm', 5, 'p2')
r.zadd('sales:11pm', 2, 'p1')
r.zadd('sales:11pm', 7, 'p1')

csvkit alone won't do the math for you, though csvsql could help. You could load your orders into R and do it, but perhaps you don't remember R and dplyr commands. In a little loop of python, you can throw all this data at Redis and it will return answers to useful questions.



In [ ]:

    
r.zrevrangebyscore('sales:10pm', 100, 0, withscores=True)



In [ ]:

    
r.zrevrangebyscore('sales:11pm', 100, 0, withscores=True)



In [ ]:

    
r.zunionstore('sales:combined', ['sales:10pm', 'sales:11pm'])



In [ ]:

    
r.zrevrangebyscore('sales:combined', 100, 0, withscores=True)

Starting to get pretty cool, right?

A practical example

Let's look at something more concrete, using a familiar source: bikeshare data. What if we want to count station use and track bike movements?



In [ ]:

    
import csv

MAX_COUNT = 10000
count = 0
fp = open('bikeshare-q1.csv', 'rb')
reader = csv.DictReader(fp)



In [ ]:

    
while count < MAX_COUNT:
    ride = reader.next()
    r.zincrby('start_station', ride['start_station'], 1)
    r.zincrby('end_station', ride['end_station'], 1)
    r.rpush('bike:%s' % ride['bike_id'], ride['end_station'])
    count += 1



In [ ]:

    
r.zrevrangebyscore('start_station', 10000, 0, start=0, num=10, withscores=True, score_cast_func=int)



In [ ]:

    
print 'last bike seen:', ride['bike_id']
r.lrange('bike:%s' % ride['bike_id'], 0, 50)

Some final thoughts

This is just scratching the surface. There are Redis client libraries for most major language you might use. The Redis docs include many more data structure types, including some very sophisticated ones, few of which you would want to code up yourselves. The docs also go into detail about getting a ton of performance out of Redis with a few short tips (the example above was way slower than Redis normally is!). Keep in mind that Redis runs in memory, so you have to be thoughtful about what you want to persist if you want data to stick around after a restart (see the docs for more). And Redis can be used for very different kinds of models, like pub/sub for inter-process communication. It works very well for this. It's also a useful backend for queueing (see RQ or Celery for examples), another easy-to-understand distributed processing framework.