Lesson 45:

Reading and Editing Word Documents

Python can also be used to create and modify Word documents.

The python-docx module can interact with Word document files, with .docx filetypes. While the module is installed via python-docx, it is imported with docx.


In [6]:
import docx

Documents have a few more features than plaintext files. They have the following objects in this module:

  • The entire document is contained within the document object.
  • The document object contains paragraph objects, which is each paragraph in the document.
  • The paragraph object contains one or more run objects.

We first navigate into the subdirectory containing the Word documents, using relative or absolute paths.


In [3]:
import os

os.chdir('files')

We then use the Document() function to open the word document as a document object.

For this example, we will be using 'demo.docx'.


In [17]:
d = docx.Document('demo.docx')

In [18]:
type(d)


Out[18]:
docx.document.Document

We can list the paragraphs within this document object.

A paragraph object is any text seperated by a newline ('Enter').


In [19]:
d.paragraphs


Out[19]:
[<docx.text.paragraph.Paragraph at 0x112b18668>,
 <docx.text.paragraph.Paragraph at 0x112b18780>,
 <docx.text.paragraph.Paragraph at 0x112b186d8>,
 <docx.text.paragraph.Paragraph at 0x112b18ba8>,
 <docx.text.paragraph.Paragraph at 0x112b186a0>,
 <docx.text.paragraph.Paragraph at 0x112b187f0>,
 <docx.text.paragraph.Paragraph at 0x112b18748>]

In [20]:
d.paragraphs[0]


Out[20]:
<docx.text.paragraph.Paragraph at 0x112b18a90>

Every paragraph object includes a .text variable containing the string of the document.


In [21]:
d.paragraphs[0].text


Out[21]:
'Document Title'

In [22]:
d.paragraphs[1].text


Out[22]:
'A plain paragraph with some bold and some italic'

As you can see, these paragraph objects match the text in the document.

We will save the paragraph object to a holder variable.


In [23]:
p = d.paragraphs[1]

Paragraph objects hold run objects.

A run object is any text that is has a different style (bold, underline, size, etc.)


In [16]:
p.runs


Out[16]:
[<docx.text.run.Run at 0x112b12ac8>,
 <docx.text.run.Run at 0x112b12cc0>,
 <docx.text.run.Run at 0x112b12198>,
 <docx.text.run.Run at 0x112b12ba8>,
 <docx.text.run.Run at 0x112b12d30>]

In [25]:
p.runs[0].text


Out[25]:
'A plain paragraph with'

In [26]:
p.runs[1].text


Out[26]:
' some '

In [27]:
p.runs[2].text


Out[27]:
'bold'

Run objects have a .bold or .italics variable, identifying whether that run has that style property.


In [28]:
p.runs[2].bold


Out[28]:
True

We can save style attributes using this approach.


In [31]:
p.runs[3].text
p.runs[3].underline = True

We can save our changes by using the .save method on the document object.

It will be saved as 'demo2.docx'.


In [32]:
d.save('demo2.docx')

Word also has document themes and styles such as 'Normal' or 'Heading 1', etc. These are accessible for paragraph objects.


In [34]:
p.style = 'Title' 
d.save('demo2.docx')

This module can also be used to create blank word documents in memory using the Document() function.


In [35]:
d = docx.Document()

We can now add paragraphs to this document object.


In [36]:
d.add_paragraph('This is a paragraph.')


Out[36]:
<docx.text.paragraph.Paragraph at 0x112b22f28>

In [37]:
d.add_paragraph('This is another paragraph.')


Out[37]:
<docx.text.paragraph.Paragraph at 0x112b22e48>

We can save this created document to the hard drive using the .save() method again. It will be saved here as demo4.docx.


In [39]:
d.save('demo4.docx')

We can also create runs within these paragraph objects.


In [40]:
# Select the first paragraph we created.
p = d.paragraphs[0]

# Add a new run to that paragraph.
p.add_run('This is a new run')

# Set the newly added run (Index 1) to a True bold variable
p.runs[1].bold = True

# Save changes
d.save('demo4.docx')

You cannot insert paragraph and run functions anywhere but the end, but there is a way to edit files by opening a new document and mirroring the old document with changes.

It still allows for some effecient batch management. Here is an example of a function to get all the Text from a word document.


In [44]:
import docx

# Create a function to get text from a file
def getText(filename):
    # Read the word document
    doc = docx.Document(filename)
    # Create an empty set to store words
    fullText = []
    # for each paragraph in the document paragraph list
    for para in doc.paragraphs:
        # Append the empty set with the new paragraph objects
        fullText.append(para.text)
    # Join the full text with newlines
    return '\n'.join(fullText)

We will test it on demo.docx and demo4.docx.


In [49]:
print(getText('demo.docx'))

print(getText('demo4.docx'))


Document Title
A plain paragraph with some bold and some italic
Heading, level 1
Intense quote
first item in unordered list
first item in ordered list


This is a paragraphThis is a new run
This is another paragraph.

Recap

  • The python-docx module can read and write .docx Word files, it is imported via docx.
  • It can open a word file with the Documunt() function.
  • Access one of the Paragraph objects from the paragraphs member variable, which is a list of Paragraph objects.
  • Paragraph objects have a text member variable containing the text as a string value.
  • Paragraphs are composed of "runs". The runs member variable of a Paragraph object contains a list of Run objects.
  • Run obejcts also have a text member variable.
  • Run objects have a bold, italic, and underline member variable which can be set to True or False.
  • Paragraph and run objects have a style member variable that can be set to one of Word's built-in styles.
  • Word files can be created by calling the .add_paragraph() and .add_run() methods to append text content.