Python can also be used to create and modify Word documents.
The python-docx
module can interact with Word document files, with .docx
filetypes. While the module is installed via python-docx
, it is imported with docx
.
In [6]:
import docx
Documents have a few more features than plaintext files. They have the following objects in this module:
We first navigate into the subdirectory containing the Word documents, using relative or absolute paths.
In [3]:
import os
os.chdir('files')
We then use the Document()
function to open the word document as a document object.
For this example, we will be using 'demo.docx'.
In [17]:
d = docx.Document('demo.docx')
In [18]:
type(d)
Out[18]:
We can list the paragraphs within this document object.
A paragraph object is any text seperated by a newline ('Enter').
In [19]:
d.paragraphs
Out[19]:
In [20]:
d.paragraphs[0]
Out[20]:
Every paragraph object includes a .text
variable containing the string of the document.
In [21]:
d.paragraphs[0].text
Out[21]:
In [22]:
d.paragraphs[1].text
Out[22]:
As you can see, these paragraph objects match the text in the document.
We will save the paragraph object to a holder variable.
In [23]:
p = d.paragraphs[1]
Paragraph objects hold run objects.
A run object is any text that is has a different style (bold, underline, size, etc.)
In [16]:
p.runs
Out[16]:
In [25]:
p.runs[0].text
Out[25]:
In [26]:
p.runs[1].text
Out[26]:
In [27]:
p.runs[2].text
Out[27]:
Run objects have a .bold
or .italics
variable, identifying whether that run has that style property.
In [28]:
p.runs[2].bold
Out[28]:
We can save style attributes using this approach.
In [31]:
p.runs[3].text
p.runs[3].underline = True
We can save our changes by using the .save
method on the document object.
It will be saved as 'demo2.docx'.
In [32]:
d.save('demo2.docx')
Word also has document themes and styles such as 'Normal' or 'Heading 1', etc. These are accessible for paragraph objects.
In [34]:
p.style = 'Title'
d.save('demo2.docx')
This module can also be used to create blank word documents in memory using the Document()
function.
In [35]:
d = docx.Document()
We can now add paragraphs to this document object.
In [36]:
d.add_paragraph('This is a paragraph.')
Out[36]:
In [37]:
d.add_paragraph('This is another paragraph.')
Out[37]:
We can save this created document to the hard drive using the .save()
method again. It will be saved here as demo4.docx.
In [39]:
d.save('demo4.docx')
We can also create runs within these paragraph objects.
In [40]:
# Select the first paragraph we created.
p = d.paragraphs[0]
# Add a new run to that paragraph.
p.add_run('This is a new run')
# Set the newly added run (Index 1) to a True bold variable
p.runs[1].bold = True
# Save changes
d.save('demo4.docx')
You cannot insert paragraph and run functions anywhere but the end, but there is a way to edit files by opening a new document and mirroring the old document with changes.
It still allows for some effecient batch management. Here is an example of a function to get all the Text from a word document.
In [44]:
import docx
# Create a function to get text from a file
def getText(filename):
# Read the word document
doc = docx.Document(filename)
# Create an empty set to store words
fullText = []
# for each paragraph in the document paragraph list
for para in doc.paragraphs:
# Append the empty set with the new paragraph objects
fullText.append(para.text)
# Join the full text with newlines
return '\n'.join(fullText)
We will test it on demo.docx and demo4.docx.
In [49]:
print(getText('demo.docx'))
print(getText('demo4.docx'))
python-docx
module can read and write .docx
Word files, it is imported via docx
. Documunt()
function..add_paragraph()
and .add_run()
methods to append text content.