Cassandra Data Modeling Refresher

This presentation is open source. All the code is executable. To run locally, clone

git@github.com:rustyrazorblade/python-presentation.git

and then

pip install -r requirements.txt

Full permission (and encouragement!) granted to use this material for your own presentations.

By Jon Haddad

Overview

This is a quick refresher into Cassandra data modeling. We're going to go over keyspace and table creation first, then dig into how CQL translates to table layout.

We're going to use the ipython-cql extension extension which lets us run CQL queries directly from IPython notebook. It's available on pypi as ipython-cql. Any query can be executed. A one line query can be executed as:

%cql select * from table

And multiline queries require a double %% like so:

%%cql select * 
       from table

These queries can be executed in cqlsh, just remove the %cql part.

A running Cassandra version > 2.0 required.

To execute a cell in this notebook, simple shift-enter. Cells can be edited and reexecuted as well.



In [ ]:

    
%load_ext cql

Keyspaces

A keyspaces is a container for tables in Cassandra.



In [ ]:

    
%cql DROP KEYSPACE if exists tutorial;
%cql CREATE KEYSPACE tutorial WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};



In [ ]:

    
%keyspace tutorial

Tables

Tables are where our data lives. It is similar in concept to a table in a RDBMS, but behaves differently. We can view all the tables in a keyspace by using desc tables on cqlsh, or by using %tables in ipython. If there aren't any tables yet, this will be pretty boring:



In [ ]:

    
%tables

Here's a few tables that define users and photos:



In [ ]:

    
%%cql

CREATE TABLE IF NOT EXISTS user (
  user_id uuid,
  name text,
  PRIMARY KEY (user_id)
)



In [ ]:

    
%%cql 

CREATE TABLE IF NOT EXISTS photo (
  photo_id uuid,
  name text,
  PRIMARY KEY (photo_id)
);



In [ ]:

    
%tables



In [ ]:

    
%desc photo



In [ ]:

    
%cql INSERT INTO user (user_id, name) VALUES (1d5dfcbe-57a6-4b23-906c-3e48db617f49, 'Jon')
%cql INSERT INTO user (user_id, name) VALUES (909c59a7-57ac-424d-97b8-a096800ab037, 'Steve')


print "Users:"
%cql SELECT * from user;



In [ ]:

    
%cql INSERT INTO photo (photo_id, name) VALUES (7dbf7758-e46c-4175-8f03-bf05d2c220d9, 'Handsome Cat')
print "Photos:"
%cql SELECT * from photo;

Partitions

Within a table, data lives in a partition. A partition is a physical grouping of related data.

Rows

Within a partition there can be many rows. The rows within a partition are sorted by the clustering key. For example, lets create a likes table. We want to keep track of which users have liked which photo, and we want to be able to query that effeciently. We know our query is always going to ask for "who liked this photo?". So we want all the likes for a particular photo to be stored together.



In [ ]:

    
%%cql

CREATE TABLE IF NOT EXISTS like (
  photo_id uuid,
  user_id uuid,
  primary key (photo_id, user_id)
);



In [ ]:

    
%tables

A primary key is written as (partition_key, clustering_key...)

When SSTAbles are written to disk, likes will be partitioned by photo_id, and sorted by user_id (the clustering key)

This makes queries predictable even for tens of thousands of rows by minimizing disk seeks.

Collections



In [ ]:

    
%%cql

CREATE TABLE users (
    id text PRIMARY KEY,
    given text,
    surname text,
    favs map<text, text>   // A map of text keys, and text values
)



In [ ]:

    
%%cql 

INSERT INTO users (id, given, surname, favs)
           VALUES ('jsmith', 'John', 'Smith', { 'fruit' : 'apple', 'band' : 'Beatles' })



In [ ]:

    
%cql select * from users;

Sets



In [ ]:

    
%%cql
CREATE TABLE images (
    name text PRIMARY KEY,
    owner text,
    date timestamp,
    tags set<text>
);



In [ ]:

    
%%cql 
INSERT INTO images (name, owner, date, tags)
            VALUES ('cat.jpg', 'jsmith', 'now', { 'kitten', 'cat', 'pet' });



In [ ]:

    
%cql select * from images;



In [ ]:

    
%%cql 
UPDATE images SET tags = tags + { 'cute', 'cuddly' } WHERE name = 'cat.jpg';



In [ ]:

    
%cql select * from images

Batches



In [ ]:

    
%%cql
BEGIN BATCH
INSERT INTO users (id, given, surname) VALUES ('jhaddad', 'Jon', 'Haddad');
INSERT INTO users (id, given, surname) VALUES ('ltran', 'Lina', 'Tran');
APPLY BATCH



In [ ]:

    
%cql  SELECT * from users;

TTL



In [ ]:

    
%cql INSERT INTO users (id, given, surname) VALUES ('ltillman', 'Luke', 'Tillman') USING TTL 1;
%cql SELECT * from users WHERE id = 'ltillman';



In [ ]:

    
%cql SELECT * from users WHERE id = 'ltillman';

Lightweight Transactions

Inserts do not do a check if the row already exists. This can lead to race conditions in your code. If you have a situation where this absolutely cannot happen, you can use a lightweight transaction. For example:



In [ ]:

    
%cql INSERT INTO users (id, given, surname) values ('psmith', 'Pete', 'Smith');
%cql INSERT INTO users (id, given, surname) values ('psmith', 'Paul', 'Smith');
%cql select * from users where id = 'psmith'; # overwrote the original!

Let's look at the original user account:



In [ ]:

    
%cql SELECT * from users where id = 'jhaddad';

Now, lets try to overwrite it with an INSERT using a lightweight transaction:



In [ ]:

    
%cql INSERT INTO users (id, given, surname) values ('jhaddad', 'Joseph', 'Haddad') IF NOT EXISTS

We can see the original record comes back if the transaction fails. We can verify the db has not been changed:



In [ ]:

    
%cql SELECT * from users where id = 'jhaddad';