SciBlox v0.01 Example Code - Titanic Dataset

1. Data Analysis

Opening files - currently CSV is only supported

Use the import * method for easier calling. (Sorry classes not done yet)

MAXROWS(x) - how many rows do you want to show (default = 15)



In [1]:

    
from sciblox import *
%matplotlib inline
maxrows(5)
from jupyterthemes import jtplot
jtplot.style()



In [2]:

    
x = read("train.csv")
read("train.csv")









    Out[2]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 12 columns

Describing and analysing your data:



In [3]:

    
analyse(x)









    Out[3]:




  
 
     
         
        Type 
        %Missing 
        %Zeroes 
        Mean 
        Median 
        Range 
        IQR 
        Var 
        Mode 
        FreqRatio 
        %Unique 
        No.Unique 
     
     
        Age 
        float 
        20 
        0 
        29.7 
        28 
        79.58 
        0.13 
        211.02 
        24 
        1.11 
        0.1 
        88 
    
     
        Cabin 
        str 
        77 
        0 
        nan 
        nan 
        nan 
        nan 
        nan 
        B96 B98 
        1 
        0.16 
        147 
    
     
        Embarked 
        str 
        0 
        0 
        nan 
        nan 
        nan 
        nan 
        nan 
        S 
        3.83 
        0 
        3 
    
     
        Fare 
        float 
        0 
        2 
        32.2 
        14.45 
        512.33 
        0 
        2469.44 
        8.05 
        1.02 
        0.28 
        248 
    
     
        Name 
        str 
        0 
        0 
        nan 
        nan 
        nan 
        nan 
        nan 
        Abbing, Mr. Anthony 
        1 
        1 
        891 
    
     
        Parch 
        int 
        0 
        76 
        0.38 
        0 
        6 
        0 
        0.65 
        0 
        5.75 
        0.01 
        7 
    
     
        PassengerId 
        int 
        0 
        0 
        446 
        446 
        890 
        4.45 
        66231 
        1 
        1 
        1 
        891 
    
     
        Pclass 
        int 
        0 
        0 
        2.31 
        3 
        2 
        0 
        0.7 
        3 
        2.27 
        0 
        3 
    
     
        Sex 
        str 
        0 
        0 
        nan 
        nan 
        nan 
        nan 
        nan 
        male 
        1.84 
        0 
        2 
    
     
        SibSp 
        int 
        0 
        68 
        0.52 
        0 
        8 
        0 
        1.22 
        0 
        2.91 
        0.01 
        7 
    
     
        Survived 
        int 
        0 
        62 
        0.38 
        0 
        1 
        0 
        0.24 
        0 
        1.61 
        0 
        2 
    
     
        Ticket 
        str 
        0 
        0 
        nan 
        nan 
        nan 
        nan 
        nan 
        1601 
        1 
        0.76 
        681

You can also change axis to 1 (both ANALYSE and DESCRIBE works)



In [4]:

    
describe(x, axis = 1)









    Out[4]:




  
 
     
         
        Mean 
        Median 
        Range 
        IQR 
        Var 
        Mode 
        FreqRatio 
        %Unique 
        No.Unique 
     
     
        Age 
        29.7 
        28 
        79.58 
        0.13 
        211.02 
        24 
        1.11 
        0.1 
        88 
    
     
        Cabin 
        nan 
        nan 
        nan 
        nan 
        nan 
        B96 B98 
        1 
        0.16 
        147 
    
     
        Embarked 
        nan 
        nan 
        nan 
        nan 
        nan 
        S 
        3.83 
        0 
        3 
    
     
        Fare 
        32.2 
        14.45 
        512.33 
        0 
        2469.44 
        8.05 
        1.02 
        0.28 
        248 
    
     
        Name 
        nan 
        nan 
        nan 
        nan 
        nan 
        Abbing, Mr. Anthony 
        1 
        1 
        891 
    
     
        Parch 
        0.38 
        0 
        6 
        0 
        0.65 
        0 
        5.75 
        0.01 
        7 
    
     
        PassengerId 
        446 
        446 
        890 
        4.45 
        66231 
        1 
        1 
        1 
        891 
    
     
        Pclass 
        2.31 
        3 
        2 
        0 
        0.7 
        3 
        2.27 
        0 
        3 
    
     
        Sex 
        nan 
        nan 
        nan 
        nan 
        nan 
        male 
        1.84 
        0 
        2 
    
     
        SibSp 
        0.52 
        0 
        8 
        0 
        1.22 
        0 
        2.91 
        0.01 
        7 
    
     
        Survived 
        0.38 
        0 
        1 
        0 
        0.24 
        0 
        1.61 
        0 
        2 
    
     
        Ticket 
        nan 
        nan 
        nan 
        nan 
        nan 
        1601 
        1 
        0.76 
        681

You can output the analysis to a dataframe



In [5]:

    
analyse(x, colour = False)









    Out[5]:







  
    
      
      Type
      %Missing
      %Zeroes
      Mean
      Median
      Range
      IQR
      Var
      Mode
      FreqRatio
      %Unique
      No.Unique
    
  
  
    
      Age
      float
      20
      0
      29.70
      28.0
      79.58
      0.13
      211.02
      24
      1.11
      0.10
      88.0
    
    
      Cabin
      str
      77
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      B96 B98
      1.00
      0.16
      147.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Survived
      int
      0
      62
      0.38
      0.0
      1.00
      0.00
      0.24
      0
      1.61
      0.00
      2.0
    
    
      Ticket
      str
      0
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      1601
      1.00
      0.76
      681.0
    
  

12 rows × 12 columns

You can also check the data's Frequency Ratio and Variance Thresholds.

It'll try to get outliers highlighted.



In [6]:

    
varcheck(x)









    Out[6]:




  
 
     
         
        FreqRatio 
        %Unique 
        Var 
        VarGood? 
     
     
        Age 
        1.11 
        0.099 
        211.019 
        True 
    
     
        Cabin 
        1 
        0.165 
        nan 
        nan 
    
     
        Embarked 
        3.83 
        0.003 
        nan 
        nan 
    
     
        Fare 
        1.02 
        0.278 
        2469.44 
        True 
    
     
        Name 
        1 
        1 
        nan 
        nan 
    
     
        Parch 
        5.75 
        0.008 
        0.65 
        True 
    
     
        PassengerId 
        1 
        1 
        66231 
        True 
    
     
        Pclass 
        2.27 
        0.003 
        0.699 
        True 
    
     
        Sex 
        1.84 
        0.002 
        nan 
        nan 
    
     
        SibSp 
        2.91 
        0.008 
        1.216 
        True 
    
     
        Survived 
        1.61 
        0.002 
        0.237 
        True 
    
     
        Ticket 
        1 
        0.764 
        nan 
        nan

You can specify thresholds:



In [7]:

    
varcheck(x, freq = "mean", unique = 0.01)









    Out[7]:




  
 
     
         
        FreqRatio 
        %Unique 
        Var 
        VarGood? 
        FreqRatioGood? 
        %UniqueGood? 
        Good? 
     
     
        Age 
        1.11 
        0.099 
        211.019 
        True 
        True 
        True 
        True 
    
     
        Cabin 
        1 
        0.165 
        nan 
        nan 
        True 
        True 
        True 
    
     
        Embarked 
        3.83 
        0.003 
        nan 
        nan 
        False 
        False 
        False 
    
     
        Fare 
        1.02 
        0.278 
        2469.44 
        True 
        True 
        True 
        True 
    
     
        Name 
        1 
        1 
        nan 
        nan 
        True 
        True 
        True 
    
     
        Parch 
        5.75 
        0.008 
        0.65 
        True 
        False 
        False 
        False 
    
     
        PassengerId 
        1 
        1 
        66231 
        True 
        True 
        True 
        True 
    
     
        Pclass 
        2.27 
        0.003 
        0.699 
        True 
        True 
        False 
        False 
    
     
        Sex 
        1.84 
        0.002 
        nan 
        nan 
        True 
        False 
        False 
    
     
        SibSp 
        2.91 
        0.008 
        1.216 
        True 
        True 
        False 
        False 
    
     
        Survived 
        1.61 
        0.002 
        0.237 
        True 
        True 
        False 
        False 
    
     
        Ticket 
        1 
        0.764 
        nan 
        nan 
        True 
        True 
        True

You can also output the correlation matrix:



In [8]:

    
corr(x)









    Out[8]:




  
 
     
         
        PassengerId 
        Survived 
        Pclass 
        Age 
        SibSp 
        Parch 
        Fare 
     
     
        PassengerId 
        1 
        -0.005 
        -0.035 
        0.037 
        -0.058 
        -0.0017 
        0.013 
    
     
        Survived 
        -0.005 
        1 
        -0.34 
        -0.077 
        -0.035 
        0.082 
        0.26 
    
     
        Pclass 
        -0.035 
        -0.34 
        1 
        -0.37 
        0.083 
        0.018 
        -0.55 
    
     
        Age 
        0.037 
        -0.077 
        -0.37 
        1 
        -0.31 
        -0.19 
        0.096 
    
     
        SibSp 
        -0.058 
        -0.035 
        0.083 
        -0.31 
        1 
        0.41 
        0.16 
    
     
        Parch 
        -0.0017 
        0.082 
        0.018 
        -0.19 
        0.41 
        1 
        0.22 
    
     
        Fare 
        0.013 
        0.26 
        -0.55 
        0.096 
        0.16 
        0.22 
        1



In [9]:

    
corr(x, table = True)









    Out[9]:







  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      PassengerId
      1.000000
      -0.005007
      -0.035144
      0.036847
      -0.057527
      -0.001652
      0.012658
    
    
      Survived
      -0.005007
      1.000000
      -0.338481
      -0.077221
      -0.035322
      0.081629
      0.257307
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Parch
      -0.001652
      0.081629
      0.018443
      -0.189119
      0.414838
      1.000000
      0.216225
    
    
      Fare
      0.012658
      0.257307
      -0.549500
      0.096067
      0.159651
      0.216225
      1.000000
    
  

7 rows × 7 columns

You can also remove correlated columns:



In [10]:

    
remcor(x, threshold = 0.5)









    Out[10]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 12 columns

2. Data Visualisations

Plotting is easy. (Currently X,Y,Factor supported)



In [11]:

    
plot(x = "Survived", y = "Fare", factor = "Embarked", data = x)



In [12]:

    
plot(x = "Fare", data = x)



In [13]:

    
plot(x = "Embarked", y = "Sex", data = x)



In [14]:

    
plot(x = "Age", y = "Parch", factor = "Fare", data = x)



In [15]:

    
plot(x = "Age", y = "Fare", factor = "Survived", data = x)









    





<matplotlib.figure.Figure at 0x2222f1dbd68>



In [171]:

    
plot(x = "SibSp", y = "Embarked", factor = "Survived", data = x)



In [172]:

    
plot(x = "Fare", y = "Age", factor = "SibSp", data = x)









    





<matplotlib.figure.Figure at 0x22232db9e10>

3. Data Cleaning

Use the FILLNA function: (Fancy Impute package, sklearn and xgboost)



In [24]:

    
%%capture
knn = fillna(x)



In [25]:

    
knn









    Out[25]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      Missing_Data
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      Missing_Data
      Q
    
  

891 rows × 12 columns

You can try MICE / BPCA / SVD methods



In [44]:

    
%%capture
svd = fillna(x, method = "svd")
bpca = fillna(x, method = "bpca")
mice = fillna(x, method = "mice", mice = "boost")
fillna(x, method = "mice", mice = "tree")
fillna(x, method = "mice", mice = "linear")



In [32]:

    
mice









    Out[32]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      Missing_Data
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      Missing_Data
      Q
    
  

891 rows × 12 columns

You can also get dummies



In [33]:

    
to_cont(x)









    Out[33]:







  
    
      
      Age
      Age_nan
      Cabin_nan
      Embarked_C
      Embarked_Q
      Embarked_S
      Fare
      Parch
      PassengerId
      Pclass
      Sex_female
      Sex_male
      SibSp
      Survived
    
  
  
    
      0
      22.0
      0
      1
      0.0
      0.0
      1.0
      7.2500
      0
      1
      3
      0
      1
      1
      0
    
    
      1
      38.0
      0
      0
      1.0
      0.0
      0.0
      71.2833
      0
      2
      1
      1
      0
      1
      1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      26.0
      0
      0
      1.0
      0.0
      0.0
      30.0000
      0
      890
      1
      0
      1
      0
      1
    
    
      890
      32.0
      0
      1
      0.0
      1.0
      0.0
      7.7500
      0
      891
      3
      0
      1
      0
      0
    
  

891 rows × 14 columns



In [34]:

    
to_cont(x, dummies = False)









    Out[34]:







  
    
      
      Age
      Age_nan
      Cabin_nan
      Embarked
      Fare
      Parch
      PassengerId
      Pclass
      Sex
      SibSp
      Survived
    
  
  
    
      0
      22.0
      0
      1
      2.0
      7.2500
      0
      1
      3
      1
      1
      0
    
    
      1
      38.0
      0
      0
      1.0
      71.2833
      0
      2
      1
      0
      1
      1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      26.0
      0
      0
      1.0
      30.0000
      0
      890
      1
      1
      0
      1
    
    
      890
      32.0
      0
      1
      0.0
      7.7500
      0
      891
      3
      1
      0
      0
    
  

891 rows × 11 columns



In [40]:

    
codes, df = to_cont(x, dummies = False, class_max = "all", return_codes = True)



In [43]:

    
codes["Embarked"]









    Out[43]:





{'C': 1, 'Q': 0, 'S': 2}

4. Data Mining

Getting strings is easy. Let's say we want to get Mr/Mrs.. honorifics

Everything is sequential



In [54]:

    
maxrows(4)
get(x["Name"])









    Out[54]:





0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
                             ...                        
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object



In [70]:

    
get(x["Name"], split = ", ")









    Out[70]:





0                              [Braund, Mr. Owen Harris]
1      [Cumings, Mrs. John Bradley (Florence Briggs T...
                             ...                        
889                              [Behr, Mr. Karl Howell]
890                                [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object

PLEASE TYPE SPLIT1 or SPLIT2 etc when you have more than 1 SPLIT



In [73]:

    
get(x["Name"], split = ", ", loc = 1, split1 = ". ", loc1 = 0, df = True)









    Out[73]:







  
    
      
      0
    
  
  
    
      0
      Mr
    
    
      1
      Mrs
    
    
      ...
      ...
    
    
      889
      Mr
    
    
      890
      Mr
    
  

891 rows × 1 columns

You can also get word frequencies



In [74]:

    
wordfreq(x)



In [75]:

    
wordfreq(x["Name"], first = 15)



In [76]:

    
wordfreq(x["Name"], first = 5, hist = False)









    Out[76]:







  
    
      
      Word
      Count
    
  
  
    
      0
      mr
      521
    
    
      1
      miss
      182
    
    
      ...
      ...
      ...
    
    
      3
      william
      64
    
    
      4
      john
      44
    
  

5 rows × 2 columns

You can also get new columns from wordfreq



In [77]:

    
getwords(x, first = 5)









    Out[77]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Count=mr
      Count=male
      Count=pc
      Count=f
      Count=s
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
      1
      1
      0
      NaN
      1.0
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
      1
      1
      1
      0.0
      0.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
      1
      1
      0
      0.0
      0.0
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
      1
      1
      0
      NaN
      0.0
    
  

891 rows × 17 columns

You can also discretise columns:



In [79]:

    
discretise(x["Fare"], n = 5)









    Out[79]:





0        (-0.001, 7.854]
1      (39.688, 512.329]
             ...        
889     (21.679, 39.688]
890      (-0.001, 7.854]
Name: Fare, Length: 891, dtype: category
Categories (5, interval[float64]): [(-0.001, 7.854] < (7.854, 10.5] < (10.5, 21.679] < (21.679, 39.688] < (39.688, 512.329]]



In [82]:

    
discretise(x["Fare"], n = 10, codes = True, smooth = False)









    Out[82]:





0      0
1      1
      ..
889    0
890    0
Name: Fare, Length: 891, dtype: int64

You can also flatten columns:



In [173]:

    
flatten(x["Name"], lower = False)[0:10]









    Out[173]:





['Braund',
 'Mr',
 'Owen',
 'Harris',
 'Cumings',
 'Mrs',
 'John',
 'Bradley',
 'Florence',
 'Briggs']

5. Data Descriptions

Getting columns and indexes is easy:



In [16]:

    
columns(x)









    Out[16]:





['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']



In [17]:

    
conts(x)









    Out[17]:





['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']



In [18]:

    
strs(x)









    Out[18]:





['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']



In [19]:

    
index(x)[0:5]









    Out[19]:





[0, 1, 2, 3, 4]

Getting uniques is easy:



In [93]:

    
unique(x)["Embarked"]









    Out[93]:





['S', 'C', 'Q', nan]



In [95]:

    
cunique(x)["Embarked"]









    Out[95]:





S    644
C    168
Q     77
Name: Embarked, dtype: int64



In [96]:

    
punique(x)









    Out[96]:





PassengerId    1.000
Survived       0.002
               ...  
Cabin          0.165
Embarked       0.003
Length: 12, dtype: float64



In [134]:

    
nunique(x["Parch"])









    Out[134]:





7

You can sort a dataframe or any datatype:



In [97]:

    
sort(x, by = ["Name"])









    Out[97]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      845
      846
      0
      3
      Abbing, Mr. Anthony
      male
      42.0
      0
      0
      C.A. 5547
      7.55
      NaN
      S
    
    
      746
      747
      0
      3
      Abbott, Mr. Rossmore Edward
      male
      16.0
      1
      1
      C.A. 2673
      20.25
      NaN
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      153
      154
      0
      3
      van Billiard, Mr. Austin Blyler
      male
      40.5
      0
      2
      A/5. 851
      14.50
      NaN
      S
    
    
      868
      869
      0
      3
      van Melkebeke, Mr. Philemon
      male
      NaN
      0
      0
      345777
      9.50
      NaN
      S
    
  

891 rows × 12 columns



In [98]:

    
sort([1,2,3,4,1,2])









    Out[98]:





[1, 1, 2, 2, 3, 4]

You can also sort by frequency then length:



In [99]:

    
fsort(x, by = "Name")









    Out[99]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      692
      693
      1
      3
      Lam, Mr. Ali
      male
      NaN
      0
      0
      1601
      56.4958
      NaN
      S
    
    
      826
      827
      0
      3
      Lam, Mr. Len
      male
      NaN
      0
      0
      1601
      56.4958
      NaN
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      427
      428
      1
      2
      Phillips, Miss. Kate Florence ("Mrs Kate Louis...
      female
      19.0
      0
      0
      250655
      26.0000
      NaN
      S
    
    
      307
      308
      1
      1
      Penasco y Castellana, Mrs. Victor de Satode (M...
      female
      17.0
      1
      0
      PC 17758
      108.9000
      C65
      C
    
  

891 rows × 12 columns

Other methods:



In [103]:

    
tail(x)
head(x)









    Out[103]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
    
  

5 rows × 12 columns



In [105]:

    
random(x)









    Out[105]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      428
      429
      0
      3
      Flynn, Mr. James
      male
      NaN
      0
      0
      364851
      7.7500
      NaN
      Q
    
    
      285
      286
      0
      3
      Stankovic, Mr. Ivan
      male
      33.0
      0
      0
      349239
      8.6625
      NaN
      C
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      395
      396
      0
      3
      Johansson, Mr. Erik
      male
      22.0
      0
      0
      350052
      7.7958
      NaN
      S
    
    
      882
      883
      0
      3
      Dahlberg, Miss. Gerda Ulrika
      female
      22.0
      0
      0
      7552
      10.5167
      NaN
      S
    
  

5 rows × 12 columns



In [106]:

    
shape(x)









    Out[106]:





(891, 12)

You can also subset NULL rows / not NULL:



In [109]:

    
isnull(x)
notnull(x, subset = "Fare")









    Out[109]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19.0
      0
      0
      112053
      30.0000
      B42
      S
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
  

183 rows × 12 columns

Cleaning columns is easy:



In [110]:

    
x["Pclass"] = float(x["Pclass"])



In [111]:

    
x["Pclass"]









    Out[111]:





0      3.0
1      1.0
      ... 
889    1.0
890    3.0
Name: Pclass, Length: 891, dtype: float64



In [114]:

    
clean(x["Pclass"])[0:10]









    Out[114]:





array([3, 1, 3, 1, 3, 3, 1, 3, 3, 2], dtype=int64)

6. Data Wrangling

Excluding columns, including columns is easy:



In [117]:

    
inc(x, "Name")
exc(x, "Name")









    Out[117]:







  
    
      
      Age
      Cabin
      Embarked
      Fare
      Parch
      PassengerId
      Pclass
      Sex
      SibSp
      Survived
      Ticket
    
  
  
    
      0
      22.0
      NaN
      S
      7.2500
      0
      1
      3.0
      male
      1
      0
      A/5 21171
    
    
      1
      38.0
      C85
      C
      71.2833
      0
      2
      1.0
      female
      1
      1
      PC 17599
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      26.0
      C148
      C
      30.0000
      0
      890
      1.0
      male
      0
      1
      111369
    
    
      890
      32.0
      NaN
      Q
      7.7500
      0
      891
      3.0
      male
      0
      0
      370376
    
  

891 rows × 11 columns

Reversing columns, reversing lists and reversing dictionaries + reversing booleans:



In [125]:

    
df = copy(x)
reverse(x["Name"])









    Out[125]:





890                                  Dooley, Mr. Patrick
889                                Behr, Mr. Karl Howell
                             ...                        
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
0                                Braund, Mr. Owen Harris
Name: Name, Length: 891, dtype: object



In [128]:

    
phone = {"Daniel":1234,"Michael":32432}
reverse(phone)









    Out[128]:





{1234: 'Daniel', 32432: 'Michael'}



In [131]:

    
(x["Survived"] == 0)









    Out[131]:





0       True
1      False
       ...  
889    False
890     True
Name: Survived, Length: 891, dtype: bool



In [132]:

    
reverse(x["Survived"] == 0)









    Out[132]:





0      False
1       True
       ...  
889     True
890    False
Name: Survived, Length: 891, dtype: bool

Horizontal concat, Vertical concat:



In [138]:

    
df = x[conts(x)]
hcat(mean(df), median(df), iqr(df), var(df), std(df))









    Out[138]:







  
    
      
      0
      0
      0
      0
      0
    
  
  
    
      PassengerId
      446.000000
      446.0000
      445.0000
      66231.000000
      257.353842
    
    
      Survived
      0.383838
      0.0000
      1.0000
      0.236772
      0.486592
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Parch
      0.381594
      0.0000
      0.0000
      0.649728
      0.806057
    
    
      Fare
      32.204208
      14.4542
      23.0896
      2469.436846
      49.693429
    
  

7 rows × 5 columns



In [142]:

    
df = x[strs(x)]
vcat(nunqiue(x),freqratio(x),count(x))









    Out[142]:







  
    
      
      0
    
  
  
    
      PassengerId
      891.0
    
    
      Survived
      2.0
    
    
      ...
      ...
    
    
      Cabin
      204.0
    
    
      Embarked
      889.0
    
  

36 rows × 1 columns

Resetting indexes:



In [143]:

    
reset(x)









    Out[143]:







  
    
      
      index
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      0
      1
      0
      3.0
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      1
      2
      1
      1.0
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      889
      889
      890
      1
      1.0
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      890
      891
      0
      3.0
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 13 columns

7. Mathematics and Statistics

Easy linear algebra:



In [148]:

    
C = array([1,2,3],[1,2,3])
A = matrix([1,2,3], [1,2,4], [5,3,2])
B = matrix("1 2 3\
            7 673 2\
            21321 22 3")
B









    Out[148]:





matrix([[    1,     2,     3],
        [    7,   673,     2],
        [21321,    22,     3]], dtype=int64)



In [149]:

    
T(B)









    Out[149]:





matrix([[    1,     7, 21321],
        [    2,   673,    22],
        [    3,     2,     3]], dtype=int64)



In [157]:

    
tile(C,1,2)









    Out[157]:





array([[1, 2, 3, 1, 2, 3],
       [1, 2, 3, 1, 2, 3]])



In [160]:

    
J(5)*Z(5)*I(5)









    Out[160]:





array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])



In [161]:

    
qnorm(95)









    Out[161]:





1.6448536269514722



In [163]:

    
pnorm(1.65)









    Out[163]:





0.50658224895572213



In [166]:

    
CI(q = 95, data = x["Fare"])









    Out[166]:





(28.941274632718805, 35.467141304430399)



In [169]:

    
M(tr(A)*diag(A))









    Out[169]:





matrix([[ 5, 10, 10]])

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
...	...	...	...	...	...	...	...	...	...	...	...	...
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

	Type	%Missing	%Zeroes	Mean	Median	Range	IQR	Var	Mode	FreqRatio	%Unique	No.Unique
Age	float	20	0	29.7	28	79.58	0.13	211.02	24	1.11	0.1	88
Cabin	str	77	0	nan	nan	nan	nan	nan	B96 B98	1	0.16	147
Embarked	str	0	0	nan	nan	nan	nan	nan	S	3.83	0	3
Fare	float	0	2	32.2	14.45	512.33	0	2469.44	8.05	1.02	0.28	248
Name	str	0	0	nan	nan	nan	nan	nan	Abbing, Mr. Anthony	1	1	891
Parch	int	0	76	0.38	0	6	0	0.65	0	5.75	0.01	7
PassengerId	int	0	0	446	446	890	4.45	66231	1	1	1	891
Pclass	int	0	0	2.31	3	2	0	0.7	3	2.27	0	3
Sex	str	0	0	nan	nan	nan	nan	nan	male	1.84	0	2
SibSp	int	0	68	0.52	0	8	0	1.22	0	2.91	0.01	7
Survived	int	0	62	0.38	0	1	0	0.24	0	1.61	0	2
Ticket	str	0	0	nan	nan	nan	nan	nan	1601	1	0.76	681

	Type	%Missing	%Zeroes	Mean	Median	Range	IQR	Var	Mode	FreqRatio	%Unique	No.Unique
Age	float	20	0	29.70	28.0	79.58	0.13	211.02	24	1.11	0.10	88.0
Cabin	str	77	0	NaN	NaN	NaN	NaN	NaN	B96 B98	1.00	0.16	147.0
...	...	...	...	...	...	...	...	...	...	...	...	...
Survived	int	0	62	0.38	0.0	1.00	0.00	0.24	0	1.61	0.00	2.0
Ticket	str	0	0	NaN	NaN	NaN	NaN	NaN	1601	1.00	0.76	681.0

	FreqRatio	%Unique	Var	VarGood?
Age	1.11	0.099	211.019	True
Cabin	1	0.165	nan	nan
Embarked	3.83	0.003	nan	nan
Fare	1.02	0.278	2469.44	True
Name	1	1	nan	nan
Parch	5.75	0.008	0.65	True
PassengerId	1	1	66231	True
Pclass	2.27	0.003	0.699	True
Sex	1.84	0.002	nan	nan
SibSp	2.91	0.008	1.216	True
Survived	1.61	0.002	0.237	True
Ticket	1	0.764	nan	nan

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	1	-0.005	-0.035	0.037	-0.058	-0.0017	0.013
Survived	-0.005	1	-0.34	-0.077	-0.035	0.082	0.26
Pclass	-0.035	-0.34	1	-0.37	0.083	0.018	-0.55
Age	0.037	-0.077	-0.37	1	-0.31	-0.19	0.096
SibSp	-0.058	-0.035	0.083	-0.31	1	0.41	0.16
Parch	-0.0017	0.082	0.018	-0.19	0.41	1	0.22
Fare	0.013	0.26	-0.55	0.096	0.16	0.22	1

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	1.000000	-0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658
Survived	-0.005007	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
...	...	...	...	...	...	...	...
Parch	-0.001652	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.012658	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
845	846	0	3	Abbing, Mr. Anthony	male	42.0	0	0	C.A. 5547	7.55	NaN	S
746	747	0	3	Abbott, Mr. Rossmore Edward	male	16.0	1	1	C.A. 2673	20.25	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
153	154	0	3	van Billiard, Mr. Austin Blyler	male	40.5	0	2	A/5. 851	14.50	NaN	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.50	NaN	S

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
692	693	1	3	Lam, Mr. Ali	male	NaN	0	0	1601	56.4958	NaN	S
826	827	0	3	Lam, Mr. Len	male	NaN	0	0	1601	56.4958	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
427	428	1	2	Phillips, Miss. Kate Florence ("Mrs Kate Louis...	female	19.0	0	0	250655	26.0000	NaN	S
307	308	1	1	Penasco y Castellana, Mrs. Victor de Satode (M...	female	17.0	1	0	PC 17758	108.9000	C65	C

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
428	429	0	3	Flynn, Mr. James	male	NaN	0	0	364851	7.7500	NaN	Q
285	286	0	3	Stankovic, Mr. Ivan	male	33.0	0	0	349239	8.6625	NaN	C
...	...	...	...	...	...	...	...	...	...	...	...	...
395	396	0	3	Johansson, Mr. Erik	male	22.0	0	0	350052	7.7958	NaN	S
882	883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	NaN	S

	0	0	0	0	0
PassengerId	446.000000	446.0000	445.0000	66231.000000	257.353842
Survived	0.383838	0.0000	1.0000	0.236772	0.486592
...	...	...	...	...	...
Parch	0.381594	0.0000	0.0000	0.649728	0.806057
Fare	32.204208	14.4542	23.0896	2469.436846	49.693429