This notebook tests the following Python scripts:


In [1]:
!ls *.py


file_fragmenter.py file_fuser.py

About These Files

These scripts make use of reusable objects. The object code is contained within them but can be deployed on your systems so that updates and enhancements then flow into the final scripts you build. If you analyze the code, you will notice that abstract objects then have child objects to do the actual work. In these objects is a seemingly useless function with no definition. It is there as a placeholder.

The code as written will split or fuse files. These placeholder functions can then be modified in the child objects to allow you to perform operations on the inputs ahead of creating the outputs. The intent is to set up inherently extensible code where the coder can focus on what needs to be done to the files and let the underlying objects handle spitting / fusing as designed.


In [2]:
!ls Test*.csv


TestFileLarge.csv             TestFile_SingleCol-NoIndx.csv
TestFileSmall-NoIndx.csv      TestFile_SingleCol-wIndx.csv
TestFileSmall-wIndx.csv

Code Testing

In the cells that follow:

  • Python scripts are tested from the command line on the csv files listed above
  • this provides a demo of command line output messages the scripts produce
  • and shows how to use the arguments for the scripts
  • code is written and tested on the assumption that input file(s) will be csv either with no index
  • or with a simple index of one column at column zero (column 0 has unique values and now header in the csv)
  • Note the following about arguments which tell the code if input/output will have an index:
    • if index related arguments have any value, code valuates them to boolean value of True
    • if index related arguments are left off, they will default to boolean value of False
    • expected usage for the index related arguments is pass them True or leave them off (do not set to False)

Getting Help:


In [3]:
!python file_fragmenter.py -h


usage: file_fragmenter.py [-h] [--input INPUT] [--input_index INPUT_INDEX]
                          [--numRows NUMROWS] [--outputStart OUTPUTSTART]
                          [--output_index OUTPUT_INDEX]

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        intput file
  --input_index INPUT_INDEX, -idx INPUT_INDEX
                        enter True if inputs have an index / Omit this
                        argument if they don't
  --numRows NUMROWS, -rw NUMROWS
                        number rows in each output file after splitting the
                        intput file
  --outputStart OUTPUTSTART, -o OUTPUTSTART
                        output file name starts with this string
  --output_index OUTPUT_INDEX, -odx OUTPUT_INDEX
                        enter true to output an index / Omit this argument to
                        leave off the index

In [4]:
!python file_fuser.py --help


usage: file_fuser.py [-h] [--inputStart INPUTSTART]
                     [--input_index INPUT_INDEX] [--output OUTPUT]
                     [--output_index OUTPUT_INDEX] [--dir DIR]

optional arguments:
  -h, --help            show this help message and exit
  --inputStart INPUTSTART, -i INPUTSTART
                        intput file name starting pattern (only files that
                        match will be fused)
  --input_index INPUT_INDEX, -idx INPUT_INDEX
                        enter True if input has an index / Omit this argument
                        if it doesn't
  --output OUTPUT, -o OUTPUT
                        Name of final output file
  --output_index OUTPUT_INDEX, -odx OUTPUT_INDEX
                        enter true to output an index / Omit this argument to
                        leave off the index
  --dir DIR, -d DIR     input files directory

Tests Using A Simple Index In The Input

Note: numRows (rw) argument defaults to 10,000 rows if it is left off. numRows (rw) arguments are set in testing with strange values just to illustrate how the code works.


In [6]:
!python file_fragmenter.py -i TestFileSmall-wIndx.csv -o TestFSwi_i -rw 13 -idx True -odx True


Input files to split:  TestFileSmall-wIndx.csv
args.input_index is set to: <class 'bool'> True
args.output_index is set to: <class 'bool'> True
Your file has  110  rows excluding the header row.
The DF Index will be output giving each data row an index field value.
To exclude the index on output, do not set the --output_index (-odx) argument (recommended if files will be imported to R).
Splitting source file into  9  smaller files.
Please wait ...
Building File  1 :  TestFSwi_i1.csv  with row count of:  13
Building File  2 :  TestFSwi_i2.csv  with row count of:  13
Building File  3 :  TestFSwi_i3.csv  with row count of:  13
Building File  4 :  TestFSwi_i4.csv  with row count of:  13
Building File  5 :  TestFSwi_i5.csv  with row count of:  13
Building File  6 :  TestFSwi_i6.csv  with row count of:  13
Building File  7 :  TestFSwi_i7.csv  with row count of:  13
Building File  8 :  TestFSwi_i8.csv  with row count of:  13
Building File  9 :  TestFSwi_i9.csv  with row count of:  6
Files Ready.
This code was written in Python  3.6.1

In [7]:
!python file_fuser.py -i TestFSwi_i -o TestFSwi_i_Out.csv -idx True -odx True


Input files will be located at:  .
args.input_index is set to: <class 'bool'> True
args.output_index is set to: <class 'bool'> True
Adding This File To Output:  TestFSwi_i1.csv
Adding This File To Output:  TestFSwi_i2.csv
Adding This File To Output:  TestFSwi_i3.csv
Adding This File To Output:  TestFSwi_i4.csv
Adding This File To Output:  TestFSwi_i5.csv
Adding This File To Output:  TestFSwi_i6.csv
Adding This File To Output:  TestFSwi_i7.csv
Adding This File To Output:  TestFSwi_i8.csv
Adding This File To Output:  TestFSwi_i9.csv
The DF Index will be output giving each data row an index field value.
Output File Created:   TestFSwi_i_Out.csv
This code was written in Python  3.6.1

In [8]:
!python file_fuser.py --inputStart TestFSwi_i --output TestFSwi_noI_Out.csv --input_index True


Input files will be located at:  .
args.input_index is set to: <class 'bool'> True
args.output_index is set to: <class 'bool'> False
Adding This File To Output:  TestFSwi_i1.csv
Adding This File To Output:  TestFSwi_i2.csv
Adding This File To Output:  TestFSwi_i3.csv
Adding This File To Output:  TestFSwi_i4.csv
Adding This File To Output:  TestFSwi_i5.csv
Adding This File To Output:  TestFSwi_i6.csv
Adding This File To Output:  TestFSwi_i7.csv
Adding This File To Output:  TestFSwi_i8.csv
Adding This File To Output:  TestFSwi_i9.csv
Adding This File To Output:  TestFSwi_i_Out.csv
If the input files have an index, this will be dropped during output.
Output File Created:   TestFSwi_noI_Out.csv
This code was written in Python  3.6.1

In above tests:

  • First test splits a single file into files with the starting pattern: TestFSwi_i
  • Second test then merges these files back into a single output file named: TestFSwi_i_Out.csv
  • Third test, merging the files into output without an index, includes the file "TestFSwi_i_Out.csv" in its output since this too fits the input file pattern

Test Using No Index in The Original Input File

This next batch of tests starts from an input csv that does not have an index on it.


In [9]:
!python file_fragmenter.py -i TestFileSmall-NoIndx.csv -o TestFSnoX_ --numRows 27


Input files to split:  TestFileSmall-NoIndx.csv
args.input_index is set to: <class 'bool'> False
args.output_index is set to: <class 'bool'> False
Your file has  108  rows excluding the header row.
Splitting source file into  4  smaller files.
Please wait ...
Building File  1 :  TestFSnoX_1.csv  with row count of:  27
Building File  2 :  TestFSnoX_2.csv  with row count of:  27
Building File  3 :  TestFSnoX_3.csv  with row count of:  27
Building File  4 :  TestFSnoX_4.csv  with row count of:  27
Files Ready.
This code was written in Python  3.6.1

In [10]:
!python file_fuser.py -i TestFSnoX_ -o TstFSnoX_Out.csv


Input files will be located at:  .
args.input_index is set to: <class 'bool'> False
args.output_index is set to: <class 'bool'> False
Adding This File To Output:  TestFSnoX_1.csv
Adding This File To Output:  TestFSnoX_2.csv
Adding This File To Output:  TestFSnoX_3.csv
Adding This File To Output:  TestFSnoX_4.csv
If the input files have an index, this will be dropped during output.
Output File Created:   TstFSnoX_Out.csv
This code was written in Python  3.6.1

In [11]:
!python file_fuser.py -i TestFSnoX_ -o TestFSnoX_Out_i.csv --output_index True


Input files will be located at:  .
args.input_index is set to: <class 'bool'> False
args.output_index is set to: <class 'bool'> True
Adding This File To Output:  TestFSnoX_1.csv
Adding This File To Output:  TestFSnoX_2.csv
Adding This File To Output:  TestFSnoX_3.csv
Adding This File To Output:  TestFSnoX_4.csv
Creating Unique Index for Output File...
Output File Created:   TestFSnoX_Out_i.csv
This code was written in Python  3.6.1

In the previous batch of tests, the original input file did not have an index. The file is split and then merged back together. Then finally, the inputs are merged again, but this time into a file with an index. The index is auto-generated and sorted to run from 0, 1, 2, ... etc.

Testing Using Single Column File

This next section repeats all of the tests before it using a csv with only one column of data in it. These tests were performed for completeness. Early drafts of the code worked on a single code but not on multiple columns. Once code was revised to work on multiple columns, these tests were performed again to show it still works on a single column. They also illustrate more permutations of the input parameter syntax.


In [13]:
!python file_fragmenter.py --input TestFile_SingleCol-NoIndx.csv --outputStart TestSCFL_noX_


Input files to split:  TestFile_SingleCol-NoIndx.csv
args.input_index is set to: <class 'bool'> False
args.output_index is set to: <class 'bool'> False
Your file has  79066  rows excluding the header row.
Splitting source file into  8  smaller files.
Please wait ...
Building File  1 :  TestSCFL_noX_1.csv  with row count of:  10000
Building File  2 :  TestSCFL_noX_2.csv  with row count of:  10000
Building File  3 :  TestSCFL_noX_3.csv  with row count of:  10000
Building File  4 :  TestSCFL_noX_4.csv  with row count of:  10000
Building File  5 :  TestSCFL_noX_5.csv  with row count of:  10000
Building File  6 :  TestSCFL_noX_6.csv  with row count of:  10000
Building File  7 :  TestSCFL_noX_7.csv  with row count of:  10000
Building File  8 :  TestSCFL_noX_8.csv  with row count of:  9066
Files Ready.
This code was written in Python  3.6.1

In [14]:
!python file_fuser.py -i TestSCFL_noX_ -o TstSCFL_noX_Out.csv


Input files will be located at:  .
args.input_index is set to: <class 'bool'> False
args.output_index is set to: <class 'bool'> False
Adding This File To Output:  TestSCFL_noX_1.csv
Adding This File To Output:  TestSCFL_noX_2.csv
Adding This File To Output:  TestSCFL_noX_3.csv
Adding This File To Output:  TestSCFL_noX_4.csv
Adding This File To Output:  TestSCFL_noX_5.csv
Adding This File To Output:  TestSCFL_noX_6.csv
Adding This File To Output:  TestSCFL_noX_7.csv
Adding This File To Output:  TestSCFL_noX_8.csv
If the input files have an index, this will be dropped during output.
Output File Created:   TstSCFL_noX_Out.csv
This code was written in Python  3.6.1

In [15]:
!python file_fuser.py -i TestSCFL_noX_ -o TestSCFL_noX_Out.csv -odx True


Input files will be located at:  .
args.input_index is set to: <class 'bool'> False
args.output_index is set to: <class 'bool'> True
Adding This File To Output:  TestSCFL_noX_1.csv
Adding This File To Output:  TestSCFL_noX_2.csv
Adding This File To Output:  TestSCFL_noX_3.csv
Adding This File To Output:  TestSCFL_noX_4.csv
Adding This File To Output:  TestSCFL_noX_5.csv
Adding This File To Output:  TestSCFL_noX_6.csv
Adding This File To Output:  TestSCFL_noX_7.csv
Adding This File To Output:  TestSCFL_noX_8.csv
Creating Unique Index for Output File...
Output File Created:   TestSCFL_noX_Out.csv
This code was written in Python  3.6.1

In [16]:
!python file_fragmenter.py -i TestFile_SingleCol-wIndx.csv -o TestSCFL_idX_ -rw 11500 -idx True -odx True


Input files to split:  TestFile_SingleCol-wIndx.csv
args.input_index is set to: <class 'bool'> True
args.output_index is set to: <class 'bool'> True
Your file has  79066  rows excluding the header row.
The DF Index will be output giving each data row an index field value.
To exclude the index on output, do not set the --output_index (-odx) argument (recommended if files will be imported to R).
Splitting source file into  7  smaller files.
Please wait ...
Building File  1 :  TestSCFL_idX_1.csv  with row count of:  11500
Building File  2 :  TestSCFL_idX_2.csv  with row count of:  11500
Building File  3 :  TestSCFL_idX_3.csv  with row count of:  11500
Building File  4 :  TestSCFL_idX_4.csv  with row count of:  11500
Building File  5 :  TestSCFL_idX_5.csv  with row count of:  11500
Building File  6 :  TestSCFL_idX_6.csv  with row count of:  11500
Building File  7 :  TestSCFL_idX_7.csv  with row count of:  10066
Files Ready.
This code was written in Python  3.6.1

In [17]:
!python file_fuser.py -i TestSCFL_idX_ -o TstSCFL_idX_Out.csv -idx True -odx True


Input files will be located at:  .
args.input_index is set to: <class 'bool'> True
args.output_index is set to: <class 'bool'> True
Adding This File To Output:  TestSCFL_idX_1.csv
Adding This File To Output:  TestSCFL_idX_2.csv
Adding This File To Output:  TestSCFL_idX_3.csv
Adding This File To Output:  TestSCFL_idX_4.csv
Adding This File To Output:  TestSCFL_idX_5.csv
Adding This File To Output:  TestSCFL_idX_6.csv
Adding This File To Output:  TestSCFL_idX_7.csv
The DF Index will be output giving each data row an index field value.
Output File Created:   TstSCFL_idX_Out.csv
This code was written in Python  3.6.1

In [18]:
!python file_fuser.py --inputStart TestSCFL_idX_ --output TestSCFL_idX_Out.csv --input_index True


Input files will be located at:  .
args.input_index is set to: <class 'bool'> True
args.output_index is set to: <class 'bool'> False
Adding This File To Output:  TestSCFL_idX_1.csv
Adding This File To Output:  TestSCFL_idX_2.csv
Adding This File To Output:  TestSCFL_idX_3.csv
Adding This File To Output:  TestSCFL_idX_4.csv
Adding This File To Output:  TestSCFL_idX_5.csv
Adding This File To Output:  TestSCFL_idX_6.csv
Adding This File To Output:  TestSCFL_idX_7.csv
If the input files have an index, this will be dropped during output.
Output File Created:   TestSCFL_idX_Out.csv
This code was written in Python  3.6.1

Final test involving a large file. Lack of --numRows (or -rw) argument means code will default to 10,000 rows per file.


In [19]:
!python file_fragmenter.py -i TestFileLarge.csv -o TestFileLrgFrag


Input files to split:  TestFileLarge.csv
args.input_index is set to: <class 'bool'> False
args.output_index is set to: <class 'bool'> False
Your file has  166106  rows excluding the header row.
Splitting source file into  17  smaller files.
Please wait ...
Building File  1 :  TestFileLrgFrag1.csv  with row count of:  10000
Building File  2 :  TestFileLrgFrag2.csv  with row count of:  10000
Building File  3 :  TestFileLrgFrag3.csv  with row count of:  10000
Building File  4 :  TestFileLrgFrag4.csv  with row count of:  10000
Building File  5 :  TestFileLrgFrag5.csv  with row count of:  10000
Building File  6 :  TestFileLrgFrag6.csv  with row count of:  10000
Building File  7 :  TestFileLrgFrag7.csv  with row count of:  10000
Building File  8 :  TestFileLrgFrag8.csv  with row count of:  10000
Building File  9 :  TestFileLrgFrag9.csv  with row count of:  10000
Building File  10 :  TestFileLrgFrag10.csv  with row count of:  10000
Building File  11 :  TestFileLrgFrag11.csv  with row count of:  10000
Building File  12 :  TestFileLrgFrag12.csv  with row count of:  10000
Building File  13 :  TestFileLrgFrag13.csv  with row count of:  10000
Building File  14 :  TestFileLrgFrag14.csv  with row count of:  10000
Building File  15 :  TestFileLrgFrag15.csv  with row count of:  10000
Building File  16 :  TestFileLrgFrag16.csv  with row count of:  10000
Building File  17 :  TestFileLrgFrag17.csv  with row count of:  6106
Files Ready.
This code was written in Python  3.6.1

In [21]:
!python file_fuser.py -i TestFileLrgFrag -o TestFileLarge_Out.csv


Input files will be located at:  .
args.input_index is set to: <class 'bool'> False
args.output_index is set to: <class 'bool'> False
Adding This File To Output:  TestFileLrgFrag1.csv
Adding This File To Output:  TestFileLrgFrag10.csv
Adding This File To Output:  TestFileLrgFrag11.csv
Adding This File To Output:  TestFileLrgFrag12.csv
Adding This File To Output:  TestFileLrgFrag13.csv
Adding This File To Output:  TestFileLrgFrag14.csv
Adding This File To Output:  TestFileLrgFrag15.csv
Adding This File To Output:  TestFileLrgFrag16.csv
Adding This File To Output:  TestFileLrgFrag17.csv
Adding This File To Output:  TestFileLrgFrag2.csv
Adding This File To Output:  TestFileLrgFrag3.csv
Adding This File To Output:  TestFileLrgFrag4.csv
Adding This File To Output:  TestFileLrgFrag5.csv
Adding This File To Output:  TestFileLrgFrag6.csv
Adding This File To Output:  TestFileLrgFrag7.csv
Adding This File To Output:  TestFileLrgFrag8.csv
Adding This File To Output:  TestFileLrgFrag9.csv
If the input files have an index, this will be dropped during output.
Output File Created:   TestFileLarge_Out.csv
This code was written in Python  3.6.1

In [ ]:
# The End