Some very basic info about the 3 data files file.

It will provide the column names and some basic information about each column. It also provides min, max, and mean for each column. Finally it provides a plot of all columns against each other.

The first analysis is for the file twitter_user_data_data.csv

As you can see, there are no missing values, and the has_profile and has_pic are boolean columns.


In [26]:
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import pandas as pd

def expAnalysis(data, colNumber):
    dimensions = data.shape
    print 'The dimensions are height: ' + str(dimensions[0]) + ' width: ' + str(dimensions[1])
    print 'The column names are: '
    for i in range(0, dimensions[1]):
        print( '  ' + data.columns[i])
    
    print 'Information about the columns'
    numcols = []
    for i in range(0, dimensions[1]):
        print('  ' + data.columns[i] + ' is ' + str( (data.ix[:,i]).dtype) )
        print('    unique values: ' + str(len(set(data.ix[:,i]))))
        print('    missing values: ' + str(sum(pd.isnull(data.ix[:,i]))) )
        if (data.ix[:,i].dtype == dtype(int64) or  data.ix[:,i].dtype == dtype(float64) ):
            numcols.append(i)
            print('    min: ' + str(min(data.ix[:,i])))
            print('    max: ' + str(max(data.ix[:,i])))
            print('    mean: ' + str(np.mean(data.ix[:,i])))
    pd.scatter_matrix(data)
    
expAnalysis(pd.io.parsers.read_csv('twitter_user_data_data.csv'), 10)


The dimensions are height: 1000 width: 10
The column names are: 
  handle
  name
  age
  num_of_tweets
  has_profile
  has_pic
  num_following
  num_of_favorites
  num_of_lists
  num_of_followers
Information about the columns
  handle is object
    unique values: 975
    missing values: 0
  name is object
    unique values: 959
    missing values: 0
  age is int64
    unique values: 753
    missing values: 0
    min: 5
    max: 2639
    mean: 1138.238
  num_of_tweets is int64
    unique values: 834
    missing values: 0
    min: 0
    max: 203191
    mean: 3765.549
  has_profile is int64
    unique values: 2
    missing values: 0
    min: 0
    max: 1
    mean: 0.75
  has_pic is int64
    unique values: 2
    missing values: 0
    min: 0
    max: 1
    mean: 0.999
  num_following is int64
    unique values: 674
    missing values: 0
    min: 0
    max: 35422
    mean: 818.344
  num_of_favorites is int64
    unique values: 234
    missing values: 0
    min: 0
    max: 112739
    mean: 565.785
  num_of_lists is int64
    unique values: 299
    missing values: 0
    min: 0
    max: 6412
    mean: 157.542
  num_of_followers is int64
    unique values: 829
    missing values: 0
    min: 21
    max: 211726
    mean: 4305.413

The next analysis if for the file twitter_user_data.csv


In [27]:
expAnalysis(pd.io.parsers.read_csv('twitter_user_data.csv'), 10)


The dimensions are height: 999 width: 10
The column names are: 
  handle
  name
  age
  num_of_tweets
  has_profile
  has_pic
  num_following
  num_of_favorites
  num_of_lists
  num_of_followers
Information about the columns
  handle is object
    unique values: 960
    missing values: 0
  name is object
    unique values: 930
    missing values: 0
  age is int64
    unique values: 661
    missing values: 0
    min: 7
    max: 2391
    mean: 1137.38838839
  num_of_tweets is int64
    unique values: 903
    missing values: 0
    min: 0
    max: 397248
    mean: 12848.3123123
  has_profile is int64
    unique values: 2
    missing values: 0
    min: 0
    max: 1
    mean: 0.746746746747
  has_pic is int64
    unique values: 2
    missing values: 0
    min: 0
    max: 1
    mean: 0.998998998999
  num_following is int64
    unique values: 658
    missing values: 0
    min: 0
    max: 61663
    mean: 957.238238238
  num_of_favorites is int64
    unique values: 247
    missing values: 0
    min: 0
    max: 36044
    mean: 231.362362362
  num_of_lists is int64
    unique values: 456
    missing values: 0
    min: 0
    max: 44092
    mean: 595.447447447
  num_of_followers is int64
    unique values: 921
    missing values: 0
    min: 12
    max: 7259168
    mean: 45344.4934935

The next analysis if for the file twitter_user_datascience_data.csv


In [28]:
expAnalysis(pd.io.parsers.read_csv('twitter_user_datascience_data.csv'), 10)


The dimensions are height: 1000 width: 10
The column names are: 
  handle
  name
  age
  num_of_tweets
  has_profile
  has_pic
  num_following
  num_of_favorites
  num_of_lists
  num_of_followers
Information about the columns
  handle is object
    unique values: 249
    missing values: 0
  name is object
    unique values: 218
    missing values: 0
  age is int64
    unique values: 176
    missing values: 0
    min: 2
    max: 2409
    mean: 276.198
  num_of_tweets is int64
    unique values: 118
    missing values: 0
    min: 0
    max: 29151
    mean: 235.865
  has_profile is int64
    unique values: 2
    missing values: 0
    min: 0
    max: 1
    mean: 0.095
  has_pic is int64
    unique values: 2
    missing values: 0
    min: 0
    max: 1
    mean: 0.138
  num_following is int64
    unique values: 130
    missing values: 0
    min: 0
    max: 6177
    mean: 61.866
  num_of_favorites is int64
    unique values: 58
    missing values: 0
    min: 0
    max: 20265
    mean: 37.584
  num_of_lists is int64
    unique values: 43
    missing values: 0
    min: 0
    max: 919
    mean: 3.934
  num_of_followers is int64
    unique values: 118
    missing values: 0
    min: 0
    max: 13621
    mean: 85.891

Separate the data into a training and test file for the twitter_user_data_data.csv file

We will use 60% for the training file and 40% for the test file.


In [29]:
twitter_data = pd.io.parsers.read_csv('twitter_user_data_data.csv')

# set the random number generator seed
random.seed(32835)

nrows = twitter_data.shape[0]
rows = range(nrows)

random.shuffle(rows)

split_point = int(nrows * .60)

train_rows = rows[:split_point]
test_rows  = rows[split_point:]

train_index = twitter_data.index[train_rows]
test_index  = twitter_data.index[test_rows]

training_data = twitter_data.ix[train_index, :]
test_data  = twitter_data.ix[test_index, :]

training_data.to_csv('twitter_user_data_data_training.csv', index=False)
test_data.to_csv('twitter_user_data_data_test.csv', index=False)

In [29]: