Opening files - currently CSV is only supported
Use the import * method for easier calling. (Sorry classes not done yet)
MAXROWS(x) - how many rows do you want to show (default = 15)
In [1]:
from sciblox import *
%matplotlib inline
maxrows(5)
from jupyterthemes import jtplot
jtplot.style()
In [2]:
x = read("train.csv")
read("train.csv")
Out[2]:
Describing and analysing your data:
In [3]:
analyse(x)
Out[3]:
You can also change axis to 1 (both ANALYSE and DESCRIBE works)
In [4]:
describe(x, axis = 1)
Out[4]:
You can output the analysis to a dataframe
In [5]:
analyse(x, colour = False)
Out[5]:
You can also check the data's Frequency Ratio and Variance Thresholds.
It'll try to get outliers highlighted.
In [6]:
varcheck(x)
Out[6]:
You can specify thresholds:
In [7]:
varcheck(x, freq = "mean", unique = 0.01)
Out[7]:
You can also output the correlation matrix:
In [8]:
corr(x)
Out[8]:
In [9]:
corr(x, table = True)
Out[9]:
You can also remove correlated columns:
In [10]:
remcor(x, threshold = 0.5)
Out[10]:
Plotting is easy. (Currently X,Y,Factor supported)
In [11]:
plot(x = "Survived", y = "Fare", factor = "Embarked", data = x)
In [12]:
plot(x = "Fare", data = x)
In [13]:
plot(x = "Embarked", y = "Sex", data = x)
In [14]:
plot(x = "Age", y = "Parch", factor = "Fare", data = x)
In [15]:
plot(x = "Age", y = "Fare", factor = "Survived", data = x)
In [171]:
plot(x = "SibSp", y = "Embarked", factor = "Survived", data = x)
In [172]:
plot(x = "Fare", y = "Age", factor = "SibSp", data = x)
Use the FILLNA function: (Fancy Impute package, sklearn and xgboost)
In [24]:
%%capture
knn = fillna(x)
In [25]:
knn
Out[25]:
You can try MICE / BPCA / SVD methods
In [44]:
%%capture
svd = fillna(x, method = "svd")
bpca = fillna(x, method = "bpca")
mice = fillna(x, method = "mice", mice = "boost")
fillna(x, method = "mice", mice = "tree")
fillna(x, method = "mice", mice = "linear")
In [32]:
mice
Out[32]:
You can also get dummies
In [33]:
to_cont(x)
Out[33]:
In [34]:
to_cont(x, dummies = False)
Out[34]:
In [40]:
codes, df = to_cont(x, dummies = False, class_max = "all", return_codes = True)
In [43]:
codes["Embarked"]
Out[43]:
Getting strings is easy. Let's say we want to get Mr/Mrs.. honorifics
Everything is sequential
In [54]:
maxrows(4)
get(x["Name"])
Out[54]:
In [70]:
get(x["Name"], split = ", ")
Out[70]:
PLEASE TYPE SPLIT1 or SPLIT2 etc when you have more than 1 SPLIT
In [73]:
get(x["Name"], split = ", ", loc = 1, split1 = ". ", loc1 = 0, df = True)
Out[73]:
You can also get word frequencies
In [74]:
wordfreq(x)
In [75]:
wordfreq(x["Name"], first = 15)
In [76]:
wordfreq(x["Name"], first = 5, hist = False)
Out[76]:
You can also get new columns from wordfreq
In [77]:
getwords(x, first = 5)
Out[77]:
You can also discretise columns:
In [79]:
discretise(x["Fare"], n = 5)
Out[79]:
In [82]:
discretise(x["Fare"], n = 10, codes = True, smooth = False)
Out[82]:
You can also flatten columns:
In [173]:
flatten(x["Name"], lower = False)[0:10]
Out[173]:
Getting columns and indexes is easy:
In [16]:
columns(x)
Out[16]:
In [17]:
conts(x)
Out[17]:
In [18]:
strs(x)
Out[18]:
In [19]:
index(x)[0:5]
Out[19]:
Getting uniques is easy:
In [93]:
unique(x)["Embarked"]
Out[93]:
In [95]:
cunique(x)["Embarked"]
Out[95]:
In [96]:
punique(x)
Out[96]:
In [134]:
nunique(x["Parch"])
Out[134]:
You can sort a dataframe or any datatype:
In [97]:
sort(x, by = ["Name"])
Out[97]:
In [98]:
sort([1,2,3,4,1,2])
Out[98]:
You can also sort by frequency then length:
In [99]:
fsort(x, by = "Name")
Out[99]:
Other methods:
In [103]:
tail(x)
head(x)
Out[103]:
In [105]:
random(x)
Out[105]:
In [106]:
shape(x)
Out[106]:
You can also subset NULL rows / not NULL:
In [109]:
isnull(x)
notnull(x, subset = "Fare")
Out[109]:
Cleaning columns is easy:
In [110]:
x["Pclass"] = float(x["Pclass"])
In [111]:
x["Pclass"]
Out[111]:
In [114]:
clean(x["Pclass"])[0:10]
Out[114]:
Excluding columns, including columns is easy:
In [117]:
inc(x, "Name")
exc(x, "Name")
Out[117]:
Reversing columns, reversing lists and reversing dictionaries + reversing booleans:
In [125]:
df = copy(x)
reverse(x["Name"])
Out[125]:
In [128]:
phone = {"Daniel":1234,"Michael":32432}
reverse(phone)
Out[128]:
In [131]:
(x["Survived"] == 0)
Out[131]:
In [132]:
reverse(x["Survived"] == 0)
Out[132]:
Horizontal concat, Vertical concat:
In [138]:
df = x[conts(x)]
hcat(mean(df), median(df), iqr(df), var(df), std(df))
Out[138]:
In [142]:
df = x[strs(x)]
vcat(nunqiue(x),freqratio(x),count(x))
Out[142]:
Resetting indexes:
In [143]:
reset(x)
Out[143]:
Easy linear algebra:
In [148]:
C = array([1,2,3],[1,2,3])
A = matrix([1,2,3], [1,2,4], [5,3,2])
B = matrix("1 2 3\
7 673 2\
21321 22 3")
B
Out[148]:
In [149]:
T(B)
Out[149]:
In [157]:
tile(C,1,2)
Out[157]:
In [160]:
J(5)*Z(5)*I(5)
Out[160]:
In [161]:
qnorm(95)
Out[161]:
In [163]:
pnorm(1.65)
Out[163]:
In [166]:
CI(q = 95, data = x["Fare"])
Out[166]:
In [169]:
M(tr(A)*diag(A))
Out[169]: