How to setup a data science blog

I'm following this article on how to make such a blog using github, and jupyter notebook, as I'm familiar with both, and it looks easy.
https://www.dataquest.io/blog/how-to-setup-a-data-science-blog/ We'll see if it actually is easy; I'm recording the steps, and the problems I'm going to encounter along the way (thus the blog name), how I fail to solve them, and how I eventually do solve them (or why I'm giving up). Lets begin...

The sales pitch

The first part of the article is all sales pitch for their idea; they tell you why you should have a DS portfolio, why a blog makes a good portfolio, why pelican is great etc etc. If you're like me, you can skip down to the part where the actual instructions begin. To be like me is pretty simple in this case; you need to have searched for an such article because you were already sold on the whole idea of wanting a data science portfolio, and had read that a blog might make a good portfolio, and you don't yet care about the tooling behind your blog, it just has to work, so Pelican is probably as good as any because... Python! If you're not like me, well the pitch isn't that long. I'll give you a moment to read it through.
.
.
.
.
Cool. Now that we're all on board with portfolios, blogs and pelican, lets begin.

One note though. The first section with any instructions has the same name as the next section name: Installing Pelican. Actually, all the H2 headers below are the same section names as in the original article, but my below sections won't contain the instructions. You'll need to read the article for them.

Installing Pelican

The instructions start with some standard things like making a folder, a .gitignore file, a virtual environment etc. things I have done once or twice. The virtual environment creating was particularly easy using conda, which is part of the Anaconda version of Python. But I hit my first error installing the requirements.

First error on "pip install -r requirements.txt"

c:\AllProjects\LearningProjects\jupyter-blog>pip install -r requirements.txt
Requirement already satisfied: ipython>=4.0...
  .
  .
  .
Found existing installation: matplotlib 1.5.3
    Uninstalling matplotlib-1.5.3:
Exception:
Traceback (most recent call last):
  File "C:\Anaconda3\lib\shutil.py", line 538, in move
    os.rename(src, real_dst)
PermissionError: [WinError 5] Access is denied: 'c:\\anaconda3\\lib\\site-packages\\matplotlib' -> 'C:\\Users\\user\\AppData\\Local\\Temp\\pip--uninstall\\anaconda3\\lib\\site-packages\\matplotlib'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Anaconda3\li ...
  .
  .
  .
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'c:\\anaconda3\\lib\\site-packages\\matplotlib\\mpl-data\\fonts\\ttf\\Vera.ttf'

Parsing this, it looks like matplotlib failed to install, and the exception was raised because a previous version couldn't be uninstalled. From the exception message, you know

1. I love obscure error messages because I am running Python on a windows machine, win 10 to be exact
2. matplotlib was in use by another process, the Vera.ttf process

Turns out I had several ipython notebooks open, most of which were using matplotlib, so the notebook kernels had their "hooks" into the matplotlib files pip was trying to uninstall. This seemed odd to me, as I was in a different virtual env to the one the notebooks were running from. Before diving into a detailed investigation of that or what Vera.ttf is, I tried what seemed to me to be the naive, easy thing: closing all the notebooks and rerun the command. And that fixed it, so matplotlib installed fine in to my blog's virtual environment.

I could stop and start that detailed investiogation, but my goal is a working blog, not a full understanding of the errors I get along the way. Vera and the possible problem in my venv will have to wait for later. (Some foreshadowing)

Creating your data science blog

The next section of the article instructed me to run the pelican-quickstart script, which was already on the cmd path when I tried it. It asked a bunch of setup questions, which were mostly meaningless to me at that point, so I went with the defaults as the article instructed unless I knew I didn't want the default. I was impressed with the help provided from within the time zone question, because when I put in something it didn't recognize as a time zone, it gave me a wikipedia page to look up the time zone format it was expecting, very nice.

This question caught my attention:

Do you want to specify a URL prefix? e.g., http://example.com (Y/n) n

The guide said to say no to it, and I did, as you can see. But I have a domain from another project I worked on last year, so I think I might actually want to say yes to this eventually, once I figure out how my domain will work with pelican and github pages. More learnings.

The final question was somewhat striking as well,

Do you want to upload your website using GitHub Pages? (y/N)

Why not? Isn't that the point? Aren't we using github for our blog? I'm hoping to figure that out by the end of this tutorial.

(Future me: It does make it clear how to use github pages for my site, but still not sure why we're not making a github pages address here and adding it in to the pelican setup)

Installing the Jupyter Plugin

The first instruction here is to create a git repo of the folder we've been setting up. I think it is odd the author doesn't have us commit everything we have done so far. I will use the git log to solve any of the bugs I'll inevitably introduce now, and also as a history of what I did to make the website. (This is called tempting Fate)

Now, that I finished commiting everything, I'll continued with setting up the github rendering plugin. This went without a hitch, but not much was run so I guess it running smoothly was likely. The only git command created a submodule from another repo, so I learned how easy it is to setup a submodule from another repo:

git submodule add git://github.com/danielfrg/pelican-ipynb.git plugins/ipynb

Awesome! This article is paying off already.

Writing your first post

The articles would have me write some dumb "Hello World!!"-type post, which I'm not going to do. This notebook will be my first post, so that I can start my blog off being a little meta.

Generating HTML

This was very smooth as well. In the top folder of the blog, I ran

pelican content

This generated some error messages and warnings.

WARNING: PLUGIN_PATH setting has been replaced by PLUGIN_PATHS, moving it to the new setting name.  
WARNING: Defining PLUGIN_PATHS setting as string has been deprecated (should be a list)  
ERROR: Could not process .ipynb_checkpoints\Making a Data Science blog-checkpoint.ipynb  
  | Exception: Could not find metadata in `.ipynb-meta`, inside `.ipynb` or external `.md` file.  
ERROR: Could not process .ipynb_checkpoints\Untitled-checkpoint.ipynb  
  | Exception: Could not find metadata in `.ipynb-meta`, inside `.ipynb` or external `.md` file.  
Done: Processed 1 article, 0 drafts, 0 pages and 0 hidden pages in 0.32 seconds.

Not sure what those mean at this point, nor am I going to dig in to it now, as the website displayed fine when I ran

python -m pelican.server

from the output folder. I'll investigate what those mean later, as they do not appear to be an impediment.

The HTML revealed that some things weren't translating very well from the notebook to the HTML. For instance,

- the tags list was too long and too big 
- too many detailed from the meta file were being displayed on the page
- not sure I like the style sheet used
- the error message above in the Raw cell is not copying over well at all

None of these are show stoppers though, so I'll figure them out later.

A quick Markdown aside

I learned some markdown things from trying to get the formatting in the last cell correct. First, ending a line with two or more spaces forces a line wrap.

Without Two Spaces after first Hello
Hello Hello
Two Spaces after first Hello
Hello
Hello

The next thing are code snippets in markdown. They work like this:
```
from that import this
```
turns into

from that import this

And if you add the key word "python" like this
```python
from that import this
```
you get

from that import this

I also learned that "\ (markdown character)" will display the character instead. Also, > at the start of a line indents some text in a nice way:

See

This indenting has some annoying interactions with lines which follow it, but markdown seems like it is full of surprising (read annoying) interactions.

Creating GitHub Page

I've done all of these tasks with git before, so no trouble here. Running this command

pelican content -s publishconf.py

did return the same errors as I had seen from the command

pelican content

I'll definately need to figure out how to clear those warnings and errors.

Commiting your files and Deploy to Github Pages

And at this point, the other shoe finally drops regarding git. I should not have been commiting to the master branch, but to a dev branch. ack!! Lets see if this can be fixed.

First, I've completed the steps for this part of the guide:

1. create a dev branch
2. push the dev branch to github

Then I'll

1. clear out the master branch of all files
2. push the cleared branch to github
3. switch to dev and run the ghp-import command
4. then push again to master

Hopefully, that will be enough to get a functioning website.

Of course, it wasn't.

Error 2 - import error

When running

pelican content -s publishconf.py

after clearing the master branch as above, an import error occured. I didn't think to record the full error before finding the fix for it.

It turns out that submodule files aren't added to the repo, so when I cleared the master branch of all files, I also cleared all the files which were not under version control including those in the submodule. But there is a command to update the submodules that can be used to replace the lost files:

git submodule update --recursive --remote

Running this put all the files of the submodule back, and then the command above executed fine without any import error.

Discovery about conda

As I tried to run the ghp-import tool as the guide recommends, I found I couldn't run it as the file could not be found. After digging in a bit, I found that the virtual environment I thought I had created was empty. I had run

conda create -n jup-blog

without any packages or python version set. This just created an empty environment without a python version even installed.

What was odd was that I had successfully installed the requirements from requirements.txt above without having pip available. It seems the pip command I had run was run from the root anaconda install, not the jup-blog venv, meaning I had installed the requirements in the root anaconda, not the in the venv as I thought. It also seems that the foreshadowed and unexpected matplotlib PermissionError I noted all the way back near the beginning came about because the root anaconda install was standing in for the empty jup-blog virtual environment. Finally, the ghp-import was not installed where the guide thought it was, so it could not be found and run.

I have now installed python 3.5, pip, all the pelican requirements into the jup-blog venv, and I've uninstalled pelican from the root version of anaconda. I tried ghp-import tool again...

Error 3 - ghp-import command not found

Running the ghp-import at the command line resulted in another 'command not found' error. After looking around, this looks like a problem noted for ghp-import version 0.4.1 on windows.
https://github.com/davisp/ghp-import/pull/25
The ghp-import package 0.4.1 does nto create an executible which windows can run, thus the error. A fix is included in 0.4.2, but the latest version of ghp-import on pypi is 0.4.1. I ran the install of ghp-import from the github repo:

pip install git+git://github.com/davisp/ghp-import.git

The command line tool that was created as expected, and success!... sorta.

Error 4 - A Type error

Running ghp-import ran fine, returening the help page. But when I ran the came that did the work I needed done

ghp-import output -b master

I got this error

Traceback (most recent call last):
  File "C:\Anaconda3\envs\jup-blog\Scripts\ghp-import-script.py", line 11, in <module>
    load_entry_point('ghp-import==0.4.2', 'console_scripts', 'ghp-import')()
  File "C:\Anaconda3\envs\jup-blog\lib\site-packages\ghp_import.py", line 194, in main
    if not try_rebase(opts.remote, opts.branch):
  File "C:\Anaconda3\envs\jup-blog\lib\site-packages\ghp_import.py", line 78, in try_rebase
    if sp.call(cmd) != 0:
  File "C:\Anaconda3\envs\jup-blog\lib\subprocess.py", line 557, in call
    with Popen(*popenargs, **kwargs) as p:
  File "C:\Anaconda3\envs\jup-blog\lib\subprocess.py", line 947, in __init__
    restore_signals, start_new_session)
  File "C:\Anaconda3\envs\jup-blog\lib\subprocess.py", line 1198, in _execute_child
    args = list2cmdline(args)
  File "C:\Anaconda3\envs\jup-blog\lib\subprocess.py", line 751, in list2cmdline
    needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: a bytes-like object is required, not 'str'

I found this issue page for the nikola project
https://github.com/getnikola/nikola/issues/2223
which linked to this fork of ghp-import called python-ghp-import
https://github.com/ionelmc/python-ghp-import/pull/1
I tried installed this repoinstead, and running the ghp-import command and success!! Then I pushed the master branch to github, and the blog is up, finally.
http://salvor7.github.io
It is ugly though! (Less ugly now, Dec 17)

Next Steps

There are some the unresolved problems I've noted along the way, and I'll need to fix those. I also need to get a better theme, and setup comments and... But this is a good place to stop for now.