I aim to use luigi, along with pytest and hypothesis to create a tested data pipeline. I've used pytest and hypothesis before, and they are both great and easy to get going with. I found Luigi with a Google search.
The first thing is to understand how luigi works, so that I might test the various things coming out of it. But the first step of TDDP is to write a failing test.
def test_luigi_ran_task():
assert False
It turns out to be a non-trivial thing to get a task to run. You can run a command at the command lien like this:
luigi --module my_module MyTask --x 123 --y 456 --local-scheduler
or run some python code like this
import luigi
luigi.run(['examples.HelloWorldTask', '--workers', '1', '--local-scheduler'])
The command line style fails off the bat, but the python command seems to work.
===== Luigi Execution Summary =====
Scheduled 1 tasks of which:
* 1 ran successfully:
- 1 examples.HelloWorldTask()
This progress looks :) because there were no failed tasks or missing external dependencies
===== Luigi Execution Summary =====
Lets rewrite the test, so that it actually runs the task, at least in theory it will.
import luigi
def test_luigi_ran_task():
luigi.run(['simio_inputs.HelloWorldTask', '--workers', '1', '--local-scheduler'])
I then run pytest at the command line, and get back this execption:
TaskClassNotFoundException
I note that in one of the examples I have it makes the comment "my_module.py, available in your sys.path" so perhaps I need to add the folder of simio_input.py to the path. . . . Tried that, and still no dice. This is presenting too many early challenges. I'm going to look for another solution for pipeline building other than Luigi.
I've used Scipy.PipeLine before, and it has felt to be more about well defined statistical or machine learning operations. I really need some to handle the messier state my data usually begins in.
To that end, I found the github repo Awesome-Pipeline. It has a list of many pipeline products, libraries, and tools. I concentrated on the libraries section, and pulled out a number of python libraries. I quick filtered those on several criteria:
- working websites
- no external dependancies like docker or Google API
- recent github activity
- apparent lack of complexity
This narrowed it down to two finalists:Consecution, and pydoit. Both are pip installable, look like they might be easy to use, and seem to fit my needs.
After getting into the details of their readme, examples and tutorials, I think doit is the better of the two for my general purposes.... It's pip installed and I'm ready to give it a go.
I tried a number of thigns to get it work with pytest, but I just got weird errors
test_scheduled_procedure_input_file.py:13:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
task_creators = <module 'bin.simio_inputs' from 'M:\\HBA\\Consulting\\Projects\\Current\\Royal Columbian Hospital Redeve
lopment\\Simulation - Interventional Floor\\InterPlatformSimioInput\\bin\\simio_inputs.py'>
def run(task_creators):
"""run doit using task_creators
@param task_creators: module or dict containing task creators
"""
> sys.exit(DoitMain(ModuleTaskLoader(task_creators)).run(sys.argv[1:]))
E SystemExit: 0
f:\Anaconda3\lib\site-packages\doit\api.py:13: SystemExit
---------------------------- Captured stdout call -----------------------------
. hello
========================== 1 failed in 9.74 seconds ===========================
and there was little help online. There seemed to be even less for how doit could work with pytest.
I'm abandonning these pipeline automation tools. There is just too much friction being added to the process by trying to use the tool. I think testing the data pipeline is more important than "properly" automating the pipeline, and I am very fluent with pytest, but I can't find easy ways to use doit or Luigi with pytest. It is just not worth my time to learn how to make these tools play nicely with each other; I'm not interested enough.