Text retrieval using the SQE API works for both authenticated and unauthenticated requests using a JSON Web Token in the Header of the request. This token is provided in the response to successful login to an activated user account. If protected data is requested without proper authentication, then an access error is returned.
This document will describe access to publicly accessible transcriptions, so the issue of authentication is not relevant for this use case.
In [ ]:
import sys, json, copy
from pprint import pprint
try:
import requests
except ImportError:
!conda install --yes --prefix {sys.prefix} requests
import requests
try:
from genson import SchemaBuilder
except ImportError:
!conda install --yes --prefix {sys.prefix} genson
from genson import SchemaBuilder
api = "https://api.qumranica.org/v1"
The SQE API accepts standard HTTP requests to defined endpoints and will always return a JSON object as a response. I highly recommend exploring the API using our interactive online SQE API documentation. You can get a birds eye view of all the endpoints there, read descriptions of those endpoints, the possible inputs, and the outputs including full specifications of all the data objects used in the communication.
Try, for instance, downloading a list of scrolls with the GET /editions
endpoint.
In [ ]:
r = requests.get(f"{api}/editions")
editions = r.json()['editions']
for edition in editions[0:5]: ## Let's only sample a couple entries
print(json.dumps(edition, indent=2, sort_keys=True, ensure_ascii=False))
You can also use the little python function editionIdByManuscriptName
here to find a edition_id in the API response by its canonical manuscript name. The function returns a list, since there may be more than one version of the edition; the first version of the edition listed is the parent from which all others were forked.
In [ ]:
def editionIdByManuscriptName(name):
eid = []
for edition in editions:
for version in edition:
if name == version['name']:
eid.append(version['id'])
return eid
manuscriptName = '4Q51'
selectedEdition = editionIdByManuscriptName(manuscriptName)
if len(selectedEdition) > 0:
selectedEdition = selectedEdition[0]
print(f"The edition id for primary version of {manuscriptName} is {selectedEdition}.")
In [ ]:
r = requests.get(f"{api}/editions/{selectedEdition}")
edition = r.json()
print(json.dumps(edition, indent=2, sort_keys=True, ensure_ascii=False))
Text in the SQE database is divided into sections of (presumably) continuous text called "text fragments". The text fragments are composed of lines, the lines are further composed of signs. Each sign can be part of one or more ordering schemes, can have one or more interpretations, and can be linked to one or more words.
The GET editions/{editionId}/text-fragments
endpoint returns the list of text fragments for an edition, in the editor's suggested order.
In [ ]:
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments")
textFragments = r.json()["textFragments"]
for textFragment in textFragments[0:min(len(textFragments), 10)]: ## Let's just look at the first ten
pprint(textFragment, indent=2)
selectedTextFragment = textFragments[0]["id"]
There are several different ways to work with transcribed text. After downloading it with the GET editions/{editionId}/text-fragments/{textFragmentId}
endpoint, you may want to serialize it into something more human freindly or better suited to your computational analysis. The transcriptions in the database are a DAG, but this call provides ordered arrays along with the necessary information to parse the DAG. The object returned is fairly complex, so I will go through it step by step. The returned object has the following schema, which is explained in detail below.
In [ ]:
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment}")
text = r.json()
builder = SchemaBuilder()
builder.add_object(text)
print(json.dumps(builder.to_schema(), indent=2, sort_keys=False, ensure_ascii=False))
An actual object looks like this.
In [ ]:
print(json.dumps(text, indent=2, sort_keys=True, ensure_ascii=False))
The text object contains several top level properties. It contains a lincense with the copyright holder and collaborators automatically generated from the user information in the database. It provides a list of editors (this serves as a key for all the editorId properties at all levels of the text object). And it provides edition name and a unique manuscriptId.
In [ ]:
trimmedTextObject = copy.deepcopy(text)
del trimmedTextObject["textFragments"]
pprint(text, depth=3)
The textFragments property contains a list of text fragments. In this case we asked for only one, so there is only one entity in the list. Each text fragment entity has a list of lines, which provides the line name, the line id, and a list of signs in the line (the signs have been removed here to make it more readable).
In [ ]:
pprint(text["textFragments"][0], depth=3)
The line contains a list of signs, each of which will contain a list of interpretations and of possible next interpretations. The next interpretation ids can be used to reconstruct all possible reading orders of the signs. The order of signs in the list is the default ordering, which should match the order of the text on the manuscript itself. Each element will have one or more sign interpretaions in the "signInterpretations" property. These entities have an id a "signInterpretation" which may be a character or may be empty if the sign interpretation has to do with formatting (like a space, or start of damage, etc.). The formatting metadata associated with the sign interpretation is in the "attributes" entity. Each attribute has an id, a code, and possible a numerical value. The codes are:
attribute_value_id | name | string_value | description |
---|---|---|---|
1 | sign_type | LETTER | Type of char |
2 | sign_type | SPACE | Type of char |
3 | sign_type | POSSIBLE_VACAT | Type of char |
4 | sign_type | VACAT | Type of char |
5 | sign_type | DAMAGE | Type of char |
6 | sign_type | BLANK LINE | Type of char |
7 | sign_type | PARAGRAPH_MARKER | Type of char |
8 | sign_type | LACUNA | Type of char |
9 | sign_type | BREAK | Type of char |
10 | break_type | LINE_START | Defines a Metasign as marking of line |
11 | break_type | LINE_END | Defines a Metasign as marking of line |
12 | break_type | COLUMN_START | Defines a Metasign as marking of line |
13 | break_type | COLUMN_END | Defines a Metasign as marking of line |
14 | break_type | MANUSCRIPT_START | Defines a Metasign as marking of line |
15 | break_type | MANUSCRIPT_END | Defines a Metasign as marking of line |
17 | might_be_wider | TRUE | Set to true if the width of the sign mght be wider than the given width |
18 | readability | INCOMPLETE_BUT_CLEAR | The trad. DJD marking of readability |
19 | readability | INCOMPLETE_AND_NOT_CLEAR | The trad. DJD marking of readability |
20 | is_reconstructed | TRUE | true if the letter is totally reconstructed (brackets are not part of the sign stream!) |
21 | editorial_flag | CONJECTURE | Opinions of the editor like conjecture |
22 | editorial_flag | SHOULD_BE_ADDED | Opinions of the editor like conjecture |
23 | editorial_flag | SHOULD_BE_DELETED | Opinions of the editor like conjecture |
24 | correction | OVERWRITTEN | Correction marks added by a scribe |
25 | correction | HORIZONTAL_LINE | Correction marks added by a scribe |
26 | correction | DIAGONAL_LEFT_LINE | Correction marks added by a scribe |
27 | correction | DIAGONAL_RIGHT_LINE | Correction marks added by a scribe |
28 | correction | DOT_BELOW | Correction marks added by a scribe |
29 | correction | DOT_ABOVE | Correction marks added by a scribe |
30 | correction | LINE_BELOW | Correction marks added by a scribe |
31 | correction | LINE_ABOVE | Correction marks added by a scribe |
32 | correction | BOXED | Correction marks added by a scribe |
33 | correction | ERASED | Correction marks added by a scribe |
34 | relative_position | ABOVE_LINE | Position relative to line context |
35 | relative_position | BELOW_LINE | Position relative to line context |
36 | relative_position | LEFT_MARGIN | Position relative to line context |
37 | relative_position | RIGHT_MARGIN | Position relative to line context |
38 | relative_position | MARGIN | Position relative to line context |
39 | relative_position | UPPER_MARGIN | Position relative to line context |
40 | relative_position | LOWER_MARGIN | Position relative to line context |
In [ ]:
trimmedSigns = text["textFragments"][0]["lines"][0]["signs"]
for sign in trimmedSigns[0:10]:
pprint(sign)
Perhaps the most simple output type for this data would be a string representation. This can be achieved by iterating over the data and building a string representation. In this example we will omit reconstructed text (i.e., text with an attribute having the id 20, see line 18 below).
In [ ]:
def readFragments(text):
formattedString = ""
for textFragment in text['textFragments']:
formattedString += f"\nText fragment {textFragment['textFragmentName']}:\n"
formattedString = readLines(textFragment, formattedString)
return formattedString
def readLines(textFragment, formattedString):
for line in textFragment['lines']:
formattedString += f"line {line['lineName']}:\n"
formattedString = readSigns(line, formattedString) + "\n"
return formattedString
def readSigns(line, formattedString):
for signs in line['signs']:
for signInterpretation in signs['signInterpretations']:
attributes = list(map(lambda x: x['attributeValueId'], signInterpretation['attributes'])) ## Get a list of attribute ids
if 20 not in attributes: ## let's omit reconstructions (attribute id 20)
if 1 in attributes: ## id 1 marks a letter
formattedString += signInterpretation['signInterpretation']
elif 2 in attributes: ## id 2 marks a space
formattedString += " "
return formattedString
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment + 3}") ## Let's grab a bigger text
text = r.json()
print(readFragments(text))
In [ ]:
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment + 3}") ## Let's grab a bigger text
text = r.json()
simplifiedTextObject = {}
for textFragment in text['textFragments']:
simplifiedTextObject[textFragment["textFragmentName"]] = []
for line in textFragment['lines']:
lineObject = {}
lineObject[line['lineName']] = []
for sign in line['signs']:
for signInterpretation in sign['signInterpretations']:
attributes = list(map(lambda x: x['attributeValueId'], signInterpretation['attributes'])) ## Get a list of attribute ids
if 20 not in attributes: ## let's omit reconstructions (attribute id 20)
if 1 in attributes: ## id 1 marks a letter
lineObject[line['lineName']].append(signInterpretation['signInterpretation'])
elif 2 in attributes: ## id 2 marks a space
lineObject[line['lineName']].append(" ")
simplifiedTextObject[textFragment["textFragmentName"]].append(lineObject)
pprint(simplifiedTextObject, indent=2)