In [ ]:
print "some words"
# print "Some more words"

We have seen that the hash (or pound) symbol '#' introduces a comment in Python. This is intended to remind a human reader (who may be the person who wrote the programme) what a particular part of the programme does or how it works. But this isn't the only place we see documentation and before we wrap up for the day I share some useful ideas about how we can best document our programmes and their development. This becomes particularly important if we are developing or sharing programmes with others and touches on aspects of version controll as well as programming.

An intresting first question is how much of a programme should be documentation? 10%? 90%?

Documentation for Python

As well as comments, many things in Python come with built in help. We've seen the file command, used to open a new file. We can use help to find out about all the options:


In [2]:
help(file)


Help on class file in module __builtin__:

class file(object)
 |  file(name[, mode[, buffering]]) -> file object
 |  
 |  Open a file.  The mode can be 'r', 'w' or 'a' for reading (default),
 |  writing or appending.  The file will be created if it doesn't exist
 |  when opened for writing or appending; it will be truncated when
 |  opened for writing.  Add a 'b' to the mode for binary files.
 |  Add a '+' to the mode to allow simultaneous reading and writing.
 |  If the buffering argument is given, 0 means unbuffered, 1 means line
 |  buffered, and larger numbers specify the buffer size.  The preferred way
 |  to open a file is with the builtin open() function.
 |  Add a 'U' to mode to open the file for input with universal newline
 |  support.  Any line ending in the input file will be seen as a '\n'
 |  in Python.  Also, a file so opened gains the attribute 'newlines';
 |  the value for this attribute is one of None (no newline read yet),
 |  '\r', '\n', '\r\n' or a tuple containing all the newline types seen.
 |  
 |  'U' cannot be combined with 'w' or '+' mode.
 |  
 |  Methods defined here:
 |  
 |  __delattr__(...)
 |      x.__delattr__('name') <==> del x.name
 |  
 |  __enter__(...)
 |      __enter__() -> self.
 |  
 |  __exit__(...)
 |      __exit__(*excinfo) -> None.  Closes the file.
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __init__(...)
 |      x.__init__(...) initializes x; see help(type(x)) for signature
 |  
 |  __iter__(...)
 |      x.__iter__() <==> iter(x)
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __setattr__(...)
 |      x.__setattr__('name', value) <==> x.name = value
 |  
 |  close(...)
 |      close() -> None or (perhaps) an integer.  Close the file.
 |      
 |      Sets data attribute .closed to True.  A closed file cannot be used for
 |      further I/O operations.  close() may be called more than once without
 |      error.  Some kinds of file objects (for example, opened by popen())
 |      may return an exit status upon closing.
 |  
 |  fileno(...)
 |      fileno() -> integer "file descriptor".
 |      
 |      This is needed for lower-level file interfaces, such os.read().
 |  
 |  flush(...)
 |      flush() -> None.  Flush the internal I/O buffer.
 |  
 |  isatty(...)
 |      isatty() -> true or false.  True if the file is connected to a tty device.
 |  
 |  next(...)
 |      x.next() -> the next value, or raise StopIteration
 |  
 |  read(...)
 |      read([size]) -> read at most size bytes, returned as a string.
 |      
 |      If the size argument is negative or omitted, read until EOF is reached.
 |      Notice that when in non-blocking mode, less data than what was requested
 |      may be returned, even if no size parameter was given.
 |  
 |  readinto(...)
 |      readinto() -> Undocumented.  Don't use this; it may go away.
 |  
 |  readline(...)
 |      readline([size]) -> next line from the file, as a string.
 |      
 |      Retain newline.  A non-negative size argument limits the maximum
 |      number of bytes to return (an incomplete line may be returned then).
 |      Return an empty string at EOF.
 |  
 |  readlines(...)
 |      readlines([size]) -> list of strings, each a line from the file.
 |      
 |      Call readline() repeatedly and return a list of the lines so read.
 |      The optional size argument, if given, is an approximate bound on the
 |      total number of bytes in the lines returned.
 |  
 |  seek(...)
 |      seek(offset[, whence]) -> None.  Move to new file position.
 |      
 |      Argument offset is a byte count.  Optional argument whence defaults to
 |      0 (offset from start of file, offset should be >= 0); other values are 1
 |      (move relative to current position, positive or negative), and 2 (move
 |      relative to end of file, usually negative, although many platforms allow
 |      seeking beyond the end of a file).  If the file is opened in text mode,
 |      only offsets returned by tell() are legal.  Use of other offsets causes
 |      undefined behavior.
 |      Note that not all file objects are seekable.
 |  
 |  tell(...)
 |      tell() -> current file position, an integer (may be a long integer).
 |  
 |  truncate(...)
 |      truncate([size]) -> None.  Truncate the file to at most size bytes.
 |      
 |      Size defaults to the current file position, as returned by tell().
 |  
 |  write(...)
 |      write(str) -> None.  Write string str to file.
 |      
 |      Note that due to buffering, flush() or close() may be needed before
 |      the file on disk reflects the data written.
 |  
 |  writelines(...)
 |      writelines(sequence_of_strings) -> None.  Write the strings to the file.
 |      
 |      Note that newlines are not added.  The sequence can be any iterable object
 |      producing strings. This is equivalent to calling write() for each string.
 |  
 |  xreadlines(...)
 |      xreadlines() -> returns self.
 |      
 |      For backward compatibility. File objects now include the performance
 |      optimizations previously implemented in the xreadlines module.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  closed
 |      True if the file is closed
 |  
 |  encoding
 |      file encoding
 |  
 |  errors
 |      Unicode error handler
 |  
 |  mode
 |      file mode ('r', 'U', 'w', 'a', possibly with 'b' or '+' added)
 |  
 |  name
 |      file name
 |  
 |  newlines
 |      end-of-line convention used in this file
 |  
 |  softspace
 |      flag indicating that a space needs to be printed; used by print
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T

There is another way to get hold of this information which is used by the help function itself:


In [1]:
print file.__doc__


file(name[, mode[, buffering]]) -> file object

Open a file.  The mode can be 'r', 'w' or 'a' for reading (default),
writing or appending.  The file will be created if it doesn't exist
when opened for writing or appending; it will be truncated when
opened for writing.  Add a 'b' to the mode for binary files.
Add a '+' to the mode to allow simultaneous reading and writing.
If the buffering argument is given, 0 means unbuffered, 1 means line
buffered, and larger numbers specify the buffer size.  The preferred way
to open a file is with the builtin open() function.
Add a 'U' to mode to open the file for input with universal newline
support.  Any line ending in the input file will be seen as a '\n'
in Python.  Also, a file so opened gains the attribute 'newlines';
the value for this attribute is one of None (no newline read yet),
'\r', '\n', '\r\n' or a tuple containing all the newline types seen.

'U' cannot be combined with 'w' or '+' mode.

and we can look at some of those other options:


In [ ]:
print file.softspace.__doc__

These __doc__ things are just strings and help just prints them out! One of the nice things about Python is that we can add __doc__s to our own functions, and Python's help system can use these. Let's see how we can do this by documenting the code below.


In [ ]:
def fahr_to_kelvin(temp):
    return ((temp - 32) * (5.0/9.0)) + 273.15

def kelvin_to_celsius(temp):
    return temp - 273.15

def fahr_to_celsius(temp):
    temp_k = fahr_to_kelvin(temp)
    result = kelvin_to_celsius(temp_k)
    return result

print 'The boiling point of water is', fahr_to_celsius(212), 'C'

Add documentation by just inserting properly indented strings for each function. e.g.


In [ ]:
def fahr_to_kelvin(temp):
    return ((temp - 32) * (5.0/9.0)) + 273.15

def kelvin_to_celsius(temp):
    return temp - 273.15

def fahr_to_celsius(temp):
    temp_k = fahr_to_kelvin(temp)
    result = kelvin_to_celsius(temp_k)
    return result

print 'The boiling point of water is', fahr_to_celsius(212), 'C'

In [ ]:
help(fahr_to_celsius)

Comments in this form are known as docstrings, and there are many tools that can work with them (for example, to create webpages describing how your functions work). There are even guidelines describing how to best format docstrings so that these tools give the best possible results. See https://www.python.org/dev/peps/pep-0257/ for the details.

It is good practice to include docstrings for your functions and make sure that these describe what the function does, what the input parameters are, what the results are, and a high level description of how it works. It is often a good idea to include references to the literature where the approach is described. One of my better efforts is below.


In [19]:
def rotT(T, g):
    """Rotate a rank 4 tensor, T, using a rotation matrix, g
       
       Tensor rotation involves a summation over all combinations
       of products of elements of the unrotated tensor and the 
       rotation matrix. Like this for a rank 3 tensor:
       
           T'(ijk) -> Sum g(i,p)*g(j,q)*g(k,r)*T(pqr)
       
       with the summation over p, q and r. The obvious implementation
       involves (2*rank) length 3 loops building up the summation in the
       inner set of loops. This optimized implementation >100 times faster 
       than that obvious implementaton using 8 nested loops. Returns a 
       3*3*3*3 numpy array representing the rotated tensor, Tprime. 
    """
    gg = np.outer(g, g) # Flatterns input and returns 9*9 array
                        # of all possible products
    gggg = np.outer(gg, gg).reshape(4 * g.shape)
                        # 81*81 array of double products reshaped
                        # to 3*3*3*3*3*3*3*3 array...
    axes = ((0, 2, 4, 6), (0, 1, 2, 3)) # We only need a subset 
                                        # of gggg in tensordot...
    return np.tensordot(gggg, T, axes)

Documentation for git and GitHub

The commit messages we write when we commit to git repositories is also useful documentation. I often end up copying text out of docstrings to explain what new code does. It is important to include details of what you are changing and why. These commit messages also have tools to allow automatic processing and, for these to work well, it can help to format the message in a standard way. One description is here: http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html.

Create new git repository containing a Python file with the temperature conversion functions above.

Things get interesting when we send our new repository to GitHub. Now everybody can see it, but do they know whay it does? Can they use our code? We need to add documentation about this too, and GitHub makes this easy.

Create a new GitHub repository, push your code to this and reload in your web browser.

We should add "README" or "README.md". You can do this online using the do this using the + button to create a new file.

Create a README file for the repository

We also need how they can use the code, all GitHub insists on is that they can show the software to other people and that you are responsible making sure that you don't upload other peoples stuff if you are not permitted to (https://help.github.com/articles/github-terms-of-service/).

If we want other people to use our code we need to give them permission, and tell them what the rules are. Broadly speaking, there are two kinds of open license for software, and half a dozen for data and publications. For software, people can choose between the GNU General Public License (GPL) on the one hand, and licenses like the MIT and BSD licenses on the other. All of these licenses allow unrestricted sharing and modification of programs, but the GPL is infective: anyone who distributes a modified version of the code (or anything that includes GPL'd code) must make their code freely available as well.

Proponents of the GPL argue that this requirement is needed to ensure that people who are benefiting from freely-available code are also contributing back to the community. Opponents counter that many open source projects have had long and successful lives without this condition, and that the GPL makes it more difficult to combine code from different sources. At the end of the day, what matters most is that:

  1. every project have a file in its home directory called something like LICENSE or LICENSE.txt that clearly states what the license is, and
  2. people use existing licenses rather than writing new ones.

The second point is as important as the first: most scientists are not lawyers, so wording that may seem sensible to a layperson may have unintended gaps or consequences. The Open Source Initiative maintains a list of open source licenses, and tl;drLegal explains many of them in plain English. GitHub has also put together a tool to help people choose a licesnse and built this process into their website.

Create a "LICENSE" file for the repository using the GitHub website using the tool to choose a licence.

It is also think about text and data. Creative Commons, sometimes known as CC, have put effort into creating licenses that are better suited for text and data than the licences for programmes described above. CC licenses can include a combination of of the following limitations on reuse:

  • BY (attribution). Derived works must give the original author credit for their work.
  • ND (no derivs). People may copy the work, but must pass it along unchanged
  • NC (non commercial). People may copy the work, but must pass it along unchanged.
  • SA (share alike). Derivative works must license their work under the same terms as the original.

These four restrictions are abbreviated "BY", "ND", "SA", and "NC" respectively, so "CC-BY-ND" means, "People can re-use the work both for free and commercially, but cannot make changes and must cite the original." These short descriptions summarize the six CC licenses in plain language, and include links to their full legal formulations.

Software Carpentry uses CC-BY for its lessons and the MIT License for its code in order to encourage the widest possible re-use.

As scientists, we probably want people to cite our work and this should include our software. We can indicate how we want people to cite our work by including a CITATION file in the root of our repository.

Create a CITATION file outlining what paper should be cited. Finally, pull the changes back to yyour local reposotory.

Key Points

  • Use help(thing) to view help for something.
  • Put docstrings in functions to provide help for that function.

  • Open scientific work is more useful and more highly cited than closed.

  • People who incorporate GPL'd software into theirs must make theirs open; most other open licenses do not require this.
  • The Creative Commons family of licenses allow people to mix and match requirements and restrictions on attribution, creation of derivative works, further sharing, and commercialization.
  • People who are not lawyers should not try to write licenses from scratch.
  • Projects can be hosted on university servers, on personal domains, or on public forges.
  • Rules regarding intellectual property and storage of sensitive information apply no matter where code and data are hosted.

Add docstrings to some of the functions you created earler today.

Find out whether you are allowed to apply an open license to your software. Can you do this unilaterally, or do you need permission from someone in your institution? If so, who?