Tutorial Brief

Video Tutorial: http://youtu.be/2b32-KVzBXQ

In many cases you don't have to go and collect the data because you have data in a local NoSQL DB that requires analysis.This could be a URL audit-trail for your website where you want to study visitors and trends. One common NoSQL Database is MongoDB which is the subject of this tutorial.

Why to use NoSQL?

NoSQL was designed to deal with the problems that came up when developers started dealing with large amount of data using the existing Relational Databases (also referred to as SQL Databases). The three main issues with SQL Databases were:

Changing Schema of Data

SQL Databases had fixed schema of tables which limited the ability to store new fields in your table. You have to change the table schema when ever you need a new field.

The Volume of Data in Storage

SQL Databases can store virtually billions of records in a single table but the problem was the overhead cost to your CPU and memory to be able to access this data efficiently.

The Frequency of Adding New Data

SQL databases are not designed to insert large amount of records in short time specially to large tables.

What did NoSQL Do?

  • Document based archives use a JSON-like format to store rows in a collection. This allows the collection to store data in any schema without changing the database.
  • NoSQL databases use different distributed clustering systems to store data in multiple machines and distribute the processing power over multiple machines. MongoDB uses GridFS and some other databases use Hadoop.
  • NoSQL dropped the support for relational data which required multiple read operations to ensure PK integrity in a relation. This reduced the cost of storing new data to simply storing the data on disk and generating PK in some systems.

Common NoSQL Databases

MongoDB

MongoDB (from "humongous") is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software.

from Wikipedia

Apache Casandra

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless replication allowing low latency operations for all clients.

from Wikipedia

Apache HBase

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

from Wikipedia

CouchDB

Apache CouchDB, commonly referred to as CouchDB, is an open source database that focuses on ease of use and on being "a database that completely embraces the web".[1] It is a NoSQL database that uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.[1] One of its distinguishing features is multi-master replication. CouchDB was first released in 2005 and later became an Apache project in 2008.

from Wikipedia

Cloud Only NoSQL (Managed Services)

Other Options:

PyMongo

Importing the Library


In [1]:
import pymongo

Creating a Connection


In [2]:
client_con = pymongo.MongoClient()

Exploring MongoDB

Listing Available Databases


In [3]:
client_con.database_names()


Out[3]:
[u'mydb', u'local', u'roshan', u'admin']

Listing Available Collections


In [4]:
roshan_db = client_con["roshan"]
roshan_db.collection_names()


Out[4]:
[u'system.indexes',
 u'Twitter',
 u'SearchTweets',
 u'TwitterFollowers',
 u'TwitterUserSearch',
 u'TCelebrities',
 u'TwitterFriendsIds',
 u'TwitterFollowersIds',
 u'TwitterIds',
 u'TwiiterTemp',
 u'test youtube']

Connecting to a Collection


In [5]:
twitter_col = roshan_db["Twitter"]

Counting Documents in a Collection


In [6]:
twitter_col.count()


Out[6]:
1045260

Finding a Single Document


In [7]:
doc = twitter_col.find_one()
doc


Out[7]:
{u'_id': ObjectId('543d819c105f1913c74ab20b'),
 u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Tue Oct 14 20:04:05 +0000 2014',
 u'entities': {u'hashtags': [],
  u'symbols': [],
  u'trends': [],
  u'urls': [],
  u'user_mentions': []},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'medium',
 u'geo': None,
 u'id': 522115647623692288L,
 u'id_str': u'522115647623692288',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'ar',
 u'place': None,
 u'possibly_sensitive': False,
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'<a href="http://w.drr-alkalam.com/" rel="nofollow">\u062f\u0631\u0631 \u0627\u0644\u0643\u0644\u0627\u0645 3</a>',
 u'text': u'\u0627\u0644\u0632\u0646\u0643 \u064a\u0648\u062c\u062f \u0628\u0648\u0641\u0631\u0629 \u0627\u0644\u0644\u062d\u0645 \u0648\u0627\u0644\u0633\u0645\u0643 \u0648\u0627\u0644\u062d\u0644\u064a\u0628 \u0627\u0644\u0637\u0627\u0632\u062c \u0648 \u0627\u0644\u062c\u0628\u0646 \u0627\u0644\u0637\u0627\u0632\u062c \u0648 \u0627\u0644\u0628\u0642\u0648\u0644\u064a\u0627\u062a \u0648 \u0627\u0644\u0644\u0628\u0646, \u0643\u0645\u0627 \u064a\u0645\u0643\u0646 \u062a\u0646\u0627\u0648\u0644\u0647 \u0643\u0645\u0643\u0645\u0644\u0627\u062a \u063a\u0630\u0627\u0626\u064a\u0629.',
 u'timestamp_ms': u'1413317045494',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Mon Sep 15 09:21:34 +0000 2014',
  u'default_profile': True,
  u'default_profile_image': False,
  u'description': u'\u062a\u0631\u0628\u064a\u0647\u06c1 \u0629 \u0634\u0627\u064a\u0628 \u0648\u0648\u0644\u0648 \ufe8e\u0646\u0646\u064a\u064a\u064e \u0641\u0642\u064a\u0631 \ufe8e\u0639\u0637\u064a\u064a\u0621 \ufe8e\u0644\u062d\u0627\u062c\u0647\u06c1 \u0648\ufe8e\u0646\u0627 \u0645\u062d\u062a\u0627\u062c\u0647\u0627\u0627 \u060c PIN:264400B1',
  u'favourites_count': 0,
  u'follow_request_sent': None,
  u'followers_count': 133,
  u'following': None,
  u'friends_count': 119,
  u'geo_enabled': False,
  u'id': 2811023737L,
  u'id_str': u'2811023737',
  u'is_translator': False,
  u'lang': u'en',
  u'listed_count': 0,
  u'location': u'',
  u'name': u'\u0639\u0640\u0627\u0645\u0640\u062f \u0627\u0644\u0640\u0639\u0640\u0637\u0640\u0627\u0648\u064a',
  u'notifications': None,
  u'profile_background_color': u'C0DEED',
  u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png',
  u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png',
  u'profile_background_tile': False,
  u'profile_image_url': u'http://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png',
  u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png',
  u'profile_link_color': u'0084B4',
  u'profile_sidebar_border_color': u'C0DEED',
  u'profile_sidebar_fill_color': u'DDEEF6',
  u'profile_text_color': u'333333',
  u'profile_use_background_image': True,
  u'protected': False,
  u'screen_name': u'Aamdaladawe_',
  u'statuses_count': 209,
  u'time_zone': None,
  u'url': None,
  u'utc_offset': None,
  u'verified': False}}

Examining a Document

ObjectId


In [8]:
document_id = doc["_id"]
print type(document_id)
document_id


<class 'bson.objectid.ObjectId'>
Out[8]:
ObjectId('543d819c105f1913c74ab20b')

In [9]:
print document_id.generation_time.strftime("%Y-%m-%d %H:%M:%SZ%z")


2014-10-14 20:03:40Z+0000

Other Fields


In [10]:
doc["created_at"]


Out[10]:
u'Tue Oct 14 20:04:05 +0000 2014'

In [11]:
doc["text"]


Out[11]:
u'\u0627\u0644\u0632\u0646\u0643 \u064a\u0648\u062c\u062f \u0628\u0648\u0641\u0631\u0629 \u0627\u0644\u0644\u062d\u0645 \u0648\u0627\u0644\u0633\u0645\u0643 \u0648\u0627\u0644\u062d\u0644\u064a\u0628 \u0627\u0644\u0637\u0627\u0632\u062c \u0648 \u0627\u0644\u062c\u0628\u0646 \u0627\u0644\u0637\u0627\u0632\u062c \u0648 \u0627\u0644\u0628\u0642\u0648\u0644\u064a\u0627\u062a \u0648 \u0627\u0644\u0644\u0628\u0646, \u0643\u0645\u0627 \u064a\u0645\u0643\u0646 \u062a\u0646\u0627\u0648\u0644\u0647 \u0643\u0645\u0643\u0645\u0644\u0627\u062a \u063a\u0630\u0627\u0626\u064a\u0629.'

Unicode Trick


In [12]:
print doc["text"]


الزنك يوجد بوفرة اللحم والسمك والحليب الطازج و الجبن الطازج و البقوليات و اللبن, كما يمكن تناوله كمكملات غذائية.

Nested Documents


In [13]:
doc["user"]


Out[13]:
{u'contributors_enabled': False,
 u'created_at': u'Mon Sep 15 09:21:34 +0000 2014',
 u'default_profile': True,
 u'default_profile_image': False,
 u'description': u'\u062a\u0631\u0628\u064a\u0647\u06c1 \u0629 \u0634\u0627\u064a\u0628 \u0648\u0648\u0644\u0648 \ufe8e\u0646\u0646\u064a\u064a\u064e \u0641\u0642\u064a\u0631 \ufe8e\u0639\u0637\u064a\u064a\u0621 \ufe8e\u0644\u062d\u0627\u062c\u0647\u06c1 \u0648\ufe8e\u0646\u0627 \u0645\u062d\u062a\u0627\u062c\u0647\u0627\u0627 \u060c PIN:264400B1',
 u'favourites_count': 0,
 u'follow_request_sent': None,
 u'followers_count': 133,
 u'following': None,
 u'friends_count': 119,
 u'geo_enabled': False,
 u'id': 2811023737L,
 u'id_str': u'2811023737',
 u'is_translator': False,
 u'lang': u'en',
 u'listed_count': 0,
 u'location': u'',
 u'name': u'\u0639\u0640\u0627\u0645\u0640\u062f \u0627\u0644\u0640\u0639\u0640\u0637\u0640\u0627\u0648\u064a',
 u'notifications': None,
 u'profile_background_color': u'C0DEED',
 u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png',
 u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png',
 u'profile_background_tile': False,
 u'profile_image_url': u'http://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png',
 u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png',
 u'profile_link_color': u'0084B4',
 u'profile_sidebar_border_color': u'C0DEED',
 u'profile_sidebar_fill_color': u'DDEEF6',
 u'profile_text_color': u'333333',
 u'profile_use_background_image': True,
 u'protected': False,
 u'screen_name': u'Aamdaladawe_',
 u'statuses_count': 209,
 u'time_zone': None,
 u'url': None,
 u'utc_offset': None,
 u'verified': False}

In [14]:
doc["user"]["verified"]


Out[14]:
False

Unicode Trick for dict


In [15]:
for k,v in doc["user"].iteritems():
    print "%s: %s" % (k,v)


follow_request_sent: None
profile_use_background_image: True
id: 2811023737
verified: False
profile_image_url_https: https://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png
profile_sidebar_fill_color: DDEEF6
is_translator: False
geo_enabled: False
profile_text_color: 333333
followers_count: 133
protected: False
location: 
default_profile_image: False
id_str: 2811023737
utc_offset: None
statuses_count: 209
description: تربيهہ ة شايب وولو ﺎننييَ فقير ﺎعطييء ﺎلحاجهہ وﺎنا محتاجهاا ، PIN:264400B1
friends_count: 119
profile_link_color: 0084B4
profile_image_url: http://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png
notifications: None
profile_background_image_url_https: https://abs.twimg.com/images/themes/theme1/bg.png
profile_background_color: C0DEED
profile_background_image_url: http://abs.twimg.com/images/themes/theme1/bg.png
screen_name: Aamdaladawe_
lang: en
profile_background_tile: False
favourites_count: 0
name: عـامـد الـعـطـاوي
url: None
created_at: Mon Sep 15 09:21:34 +0000 2014
contributors_enabled: False
time_zone: None
profile_sidebar_border_color: C0DEED
default_profile: True
following: None
listed_count: 0

Finding Multiple Documents


In [16]:
verified_users = twitter_col.find({"user.verified":True})

In [17]:
verified_users.count()


Out[17]:
672

In [18]:
import pandas

users = pandas.DataFrame([msg["user"] for msg in verified_users])
users


Out[18]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 672 entries, 0 to 671
Data columns (total 38 columns):
contributors_enabled                  672  non-null values
created_at                            672  non-null values
default_profile                       672  non-null values
default_profile_image                 672  non-null values
description                           672  non-null values
favourites_count                      672  non-null values
follow_request_sent                   0  non-null values
followers_count                       672  non-null values
following                             0  non-null values
friends_count                         672  non-null values
geo_enabled                           672  non-null values
id                                    672  non-null values
id_str                                672  non-null values
is_translator                         672  non-null values
lang                                  672  non-null values
listed_count                          672  non-null values
location                              672  non-null values
name                                  672  non-null values
notifications                         0  non-null values
profile_background_color              672  non-null values
profile_background_image_url          672  non-null values
profile_background_image_url_https    672  non-null values
profile_background_tile               672  non-null values
profile_banner_url                    578  non-null values
profile_image_url                     672  non-null values
profile_image_url_https               672  non-null values
profile_link_color                    672  non-null values
profile_sidebar_border_color          672  non-null values
profile_sidebar_fill_color            672  non-null values
profile_text_color                    672  non-null values
profile_use_background_image          672  non-null values
protected                             672  non-null values
screen_name                           672  non-null values
statuses_count                        672  non-null values
time_zone                             656  non-null values
url                                   631  non-null values
utc_offset                            656  non-null values
verified                              672  non-null values
dtypes: bool(9), float64(1), int64(6), object(22)

Filtering out duplicates


In [19]:
users_unique = users.drop_duplicates(cols=["id"])
print len(users_unique)


222

Visualizing the realtion between number of tweets and the number of followers


In [20]:
x = users_unique["followers_count"]
y = users_unique["statuses_count"]
z = users_unique["friends_count"]
plt.scatter(x, y, c=z, alpha=0.4, s=200, cmap=plt.cm.Accent)
plt.yscale("symlog")
plt.xscale("symlog")
plt.ylim(y.min(),y.max())
plt.xlim(x.min(),x.max())
plt.grid()
plt.xlabel("followers_count")
plt.ylabel("statuses_count")
plt.colorbar()
plt.show()