Retrieving text with the SQE API

Text retrieval using the SQE API works for both authenticated and unauthenticated requests using a JSON Web Token in the Header of the request. This token is provided in the response to successful login to an activated user account. If protected data is requested without proper authentication, then an access error is returned.

This document will describe access to publicly accessible transcriptions, so the issue of authentication is not relevant for this use case.

First pull in the dependencies


In [ ]:
import sys, json, copy
from pprint import pprint

try:
    import requests
except ImportError:
    !conda install --yes --prefix {sys.prefix} requests
    import requests
    
try:
    from genson import SchemaBuilder
except ImportError:
    !conda install --yes --prefix {sys.prefix} genson
    from genson import SchemaBuilder

api = "https://api.qumranica.org/v1"

Making requests

The SQE API accepts standard HTTP requests to defined endpoints and will always return a JSON object as a response. I highly recommend exploring the API using our interactive online SQE API documentation. You can get a birds eye view of all the endpoints there, read descriptions of those endpoints, the possible inputs, and the outputs including full specifications of all the data objects used in the communication.

Finding all available scrolls

Try, for instance, downloading a list of scrolls with the GET /editions endpoint.


In [ ]:
r = requests.get(f"{api}/editions")
editions = r.json()['editions']
for edition in editions[0:5]: ## Let's only sample a couple entries
    print(json.dumps(edition, indent=2, sort_keys=True, ensure_ascii=False))

You can also use the little python function editionIdByManuscriptName here to find a edition_id in the API response by its canonical manuscript name. The function returns a list, since there may be more than one version of the edition; the first version of the edition listed is the parent from which all others were forked.


In [ ]:
def editionIdByManuscriptName(name):
    eid = []
    for edition in editions:
        for version in edition:
            if name == version['name']:
                eid.append(version['id'])
    return eid

manuscriptName = '4Q51'
selectedEdition = editionIdByManuscriptName(manuscriptName)
if len(selectedEdition) > 0:
    selectedEdition = selectedEdition[0]
print(f"The edition id for primary version of {manuscriptName} is {selectedEdition}.")

Information about a specific edition

The API transaction editions/{editionId} will provide detailed information about the requested edition including its primary version and any derivative versions.


In [ ]:
r = requests.get(f"{api}/editions/{selectedEdition}")
edition = r.json()
print(json.dumps(edition, indent=2, sort_keys=True, ensure_ascii=False))

Information about the transcribed text

Text in the SQE database is divided into sections of (presumably) continuous text called "text fragments". The text fragments are composed of lines, the lines are further composed of signs. Each sign can be part of one or more ordering schemes, can have one or more interpretations, and can be linked to one or more words.

The GET editions/{editionId}/text-fragments endpoint returns the list of text fragments for an edition, in the editor's suggested order.


In [ ]:
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments")
textFragments = r.json()["textFragments"]
for textFragment in textFragments[0:min(len(textFragments), 10)]: ## Let's just look at the first ten
    pprint(textFragment, indent=2)
selectedTextFragment = textFragments[0]["id"]

Transcriptions

There are several different ways to work with transcribed text. After downloading it with the GET editions/{editionId}/text-fragments/{textFragmentId} endpoint, you may want to serialize it into something more human freindly or better suited to your computational analysis. The transcriptions in the database are a DAG, but this call provides ordered arrays along with the necessary information to parse the DAG. The object returned is fairly complex, so I will go through it step by step. The returned object has the following schema, which is explained in detail below.


In [ ]:
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment}")
text = r.json()

builder = SchemaBuilder()
builder.add_object(text)
print(json.dumps(builder.to_schema(), indent=2, sort_keys=False, ensure_ascii=False))

An actual object looks like this.


In [ ]:
print(json.dumps(text, indent=2, sort_keys=True, ensure_ascii=False))

Structure of the text object

The text object contains several top level properties. It contains a lincense with the copyright holder and collaborators automatically generated from the user information in the database. It provides a list of editors (this serves as a key for all the editorId properties at all levels of the text object). And it provides edition name and a unique manuscriptId.


In [ ]:
trimmedTextObject = copy.deepcopy(text)
del trimmedTextObject["textFragments"]

pprint(text, depth=3)

Nested objects

The textFragments property contains a list of text fragments. In this case we asked for only one, so there is only one entity in the list. Each text fragment entity has a list of lines, which provides the line name, the line id, and a list of signs in the line (the signs have been removed here to make it more readable).


In [ ]:
pprint(text["textFragments"][0], depth=3)

Lines and Sign interpretation metadata

The line contains a list of signs, each of which will contain a list of interpretations and of possible next interpretations. The next interpretation ids can be used to reconstruct all possible reading orders of the signs. The order of signs in the list is the default ordering, which should match the order of the text on the manuscript itself. Each element will have one or more sign interpretaions in the "signInterpretations" property. These entities have an id a "signInterpretation" which may be a character or may be empty if the sign interpretation has to do with formatting (like a space, or start of damage, etc.). The formatting metadata associated with the sign interpretation is in the "attributes" entity. Each attribute has an id, a code, and possible a numerical value. The codes are:

attribute_value_id name string_value description
1 sign_type LETTER Type of char
2 sign_type SPACE Type of char
3 sign_type POSSIBLE_VACAT Type of char
4 sign_type VACAT Type of char
5 sign_type DAMAGE Type of char
6 sign_type BLANK LINE Type of char
7 sign_type PARAGRAPH_MARKER Type of char
8 sign_type LACUNA Type of char
9 sign_type BREAK Type of char
10 break_type LINE_START Defines a Metasign as marking of line
11 break_type LINE_END Defines a Metasign as marking of line
12 break_type COLUMN_START Defines a Metasign as marking of line
13 break_type COLUMN_END Defines a Metasign as marking of line
14 break_type MANUSCRIPT_START Defines a Metasign as marking of line
15 break_type MANUSCRIPT_END Defines a Metasign as marking of line
17 might_be_wider TRUE Set to true if the width of the sign mght be wider than the given width
18 readability INCOMPLETE_BUT_CLEAR The trad. DJD marking of readability
19 readability INCOMPLETE_AND_NOT_CLEAR The trad. DJD marking of readability
20 is_reconstructed TRUE true if the letter is totally reconstructed (brackets are not part of the sign stream!)
21 editorial_flag CONJECTURE Opinions of the editor like conjecture
22 editorial_flag SHOULD_BE_ADDED Opinions of the editor like conjecture
23 editorial_flag SHOULD_BE_DELETED Opinions of the editor like conjecture
24 correction OVERWRITTEN Correction marks added by a scribe
25 correction HORIZONTAL_LINE Correction marks added by a scribe
26 correction DIAGONAL_LEFT_LINE Correction marks added by a scribe
27 correction DIAGONAL_RIGHT_LINE Correction marks added by a scribe
28 correction DOT_BELOW Correction marks added by a scribe
29 correction DOT_ABOVE Correction marks added by a scribe
30 correction LINE_BELOW Correction marks added by a scribe
31 correction LINE_ABOVE Correction marks added by a scribe
32 correction BOXED Correction marks added by a scribe
33 correction ERASED Correction marks added by a scribe
34 relative_position ABOVE_LINE Position relative to line context
35 relative_position BELOW_LINE Position relative to line context
36 relative_position LEFT_MARGIN Position relative to line context
37 relative_position RIGHT_MARGIN Position relative to line context
38 relative_position MARGIN Position relative to line context
39 relative_position UPPER_MARGIN Position relative to line context
40 relative_position LOWER_MARGIN Position relative to line context

In [ ]:
trimmedSigns = text["textFragments"][0]["lines"][0]["signs"]
for sign in trimmedSigns[0:10]:
    pprint(sign)

Serializing the data to a string

Perhaps the most simple output type for this data would be a string representation. This can be achieved by iterating over the data and building a string representation. In this example we will omit reconstructed text (i.e., text with an attribute having the id 20, see line 18 below).


In [ ]:
def readFragments(text):
    formattedString = ""
    for textFragment in text['textFragments']:
        formattedString += f"\nText fragment {textFragment['textFragmentName']}:\n"
        formattedString = readLines(textFragment, formattedString)
        
    return formattedString

def readLines(textFragment, formattedString):
    for line in textFragment['lines']:
        formattedString += f"line {line['lineName']}:\n"
        formattedString = readSigns(line, formattedString) + "\n"
        
    return formattedString

def readSigns(line, formattedString):
    for signs in line['signs']:
        for signInterpretation in signs['signInterpretations']:
            attributes = list(map(lambda x: x['attributeValueId'], signInterpretation['attributes'])) ## Get a list of attribute ids
            if 20 not in attributes: ## let's omit reconstructions (attribute id 20)
                if 1 in attributes: ## id 1 marks a letter
                    formattedString += signInterpretation['signInterpretation']
                elif 2 in attributes: ## id 2 marks a space
                    formattedString += " "
                
    return formattedString
        
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment + 3}") ## Let's grab a bigger text
text = r.json()

print(readFragments(text))

Serializing the data to a simpler object

We can also serialize the data to a more simple data structure for computational purposes.


In [ ]:
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment + 3}") ## Let's grab a bigger text
text = r.json()

simplifiedTextObject = {}
for textFragment in text['textFragments']:
    simplifiedTextObject[textFragment["textFragmentName"]] = []
    
    for line in textFragment['lines']:
        lineObject = {}
        lineObject[line['lineName']] = []
        
        for sign in line['signs']:
            for signInterpretation in sign['signInterpretations']:
                attributes = list(map(lambda x: x['attributeValueId'], signInterpretation['attributes'])) ## Get a list of attribute ids
                if 20 not in attributes: ## let's omit reconstructions (attribute id 20)
                    if 1 in attributes: ## id 1 marks a letter
                        lineObject[line['lineName']].append(signInterpretation['signInterpretation'])
                    elif 2 in attributes: ## id 2 marks a space
                        lineObject[line['lineName']].append(" ")
                        
        simplifiedTextObject[textFragment["textFragmentName"]].append(lineObject)

pprint(simplifiedTextObject, indent=2)