Video Tutorial: http://youtu.be/2b32-KVzBXQ
In many cases you don't have to go and collect the data because you have data in a local NoSQL DB that requires analysis.This could be a URL audit-trail for your website where you want to study visitors and trends. One common NoSQL Database is MongoDB which is the subject of this tutorial.
NoSQL was designed to deal with the problems that came up when developers started dealing with large amount of data using the existing Relational Databases (also referred to as SQL Databases). The three main issues with SQL Databases were:
SQL Databases had fixed schema of tables which limited the ability to store new fields in your table. You have to change the table schema when ever you need a new field.
SQL Databases can store virtually billions of records in a single table but the problem was the overhead cost to your CPU and memory to be able to access this data efficiently.
SQL databases are not designed to insert large amount of records in short time specially to large tables.
MongoDB (from "humongous") is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software.
from Wikipedia
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless replication allowing low latency operations for all clients.
from Wikipedia
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
from Wikipedia
Apache CouchDB, commonly referred to as CouchDB, is an open source database that focuses on ease of use and on being "a database that completely embraces the web".[1] It is a NoSQL database that uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.[1] One of its distinguishing features is multi-master replication. CouchDB was first released in 2005 and later became an Apache project in 2008.
from Wikipedia
Documentation: http://api.mongodb.org/python/current/
In [1]:
import pymongo
In [2]:
client_con = pymongo.MongoClient()
In [3]:
client_con.database_names()
Out[3]:
In [4]:
roshan_db = client_con["roshan"]
roshan_db.collection_names()
Out[4]:
In [5]:
twitter_col = roshan_db["Twitter"]
In [6]:
twitter_col.count()
Out[6]:
In [7]:
doc = twitter_col.find_one()
doc
Out[7]:
Documentation: http://docs.mongodb.org/manual/reference/object-id/
In [8]:
document_id = doc["_id"]
print type(document_id)
document_id
Out[8]:
In [9]:
print document_id.generation_time.strftime("%Y-%m-%d %H:%M:%SZ%z")
In [10]:
doc["created_at"]
Out[10]:
In [11]:
doc["text"]
Out[11]:
In [12]:
print doc["text"]
In [13]:
doc["user"]
Out[13]:
In [14]:
doc["user"]["verified"]
Out[14]:
In [15]:
for k,v in doc["user"].iteritems():
print "%s: %s" % (k,v)
In [16]:
verified_users = twitter_col.find({"user.verified":True})
In [17]:
verified_users.count()
Out[17]:
In [18]:
import pandas
users = pandas.DataFrame([msg["user"] for msg in verified_users])
users
Out[18]:
In [19]:
users_unique = users.drop_duplicates(cols=["id"])
print len(users_unique)
In [20]:
x = users_unique["followers_count"]
y = users_unique["statuses_count"]
z = users_unique["friends_count"]
plt.scatter(x, y, c=z, alpha=0.4, s=200, cmap=plt.cm.Accent)
plt.yscale("symlog")
plt.xscale("symlog")
plt.ylim(y.min(),y.max())
plt.xlim(x.min(),x.max())
plt.grid()
plt.xlabel("followers_count")
plt.ylabel("statuses_count")
plt.colorbar()
plt.show()