Retrieving text with the SQE API

Text retrieval using the SQE API works for both authenticated and unauthenticated requests using a JSON Web Token in the Header of the request. This token is provided in the response to successful login to an activated user account. If protected data is requested without proper authentication, then an access error is returned.

This document will describe access to publicly accessible transcriptions, so the issue of authentication is not relevant for this use case.

First pull in the dependencies



In [ ]:

    
import sys, json, copy
from pprint import pprint

try:
    import requests
except ImportError:
    !conda install --yes --prefix {sys.prefix} requests
    import requests
    
try:
    from genson import SchemaBuilder
except ImportError:
    !conda install --yes --prefix {sys.prefix} genson
    from genson import SchemaBuilder

api = "https://api.qumranica.org/v1"

Making requests

The SQE API accepts standard HTTP requests to defined endpoints and will always return a JSON object as a response. I highly recommend exploring the API using our interactive online SQE API documentation. You can get a birds eye view of all the endpoints there, read descriptions of those endpoints, the possible inputs, and the outputs including full specifications of all the data objects used in the communication.

Finding all available scrolls

Try, for instance, downloading a list of scrolls with the GET /editions endpoint.



In [ ]:

    
r = requests.get(f"{api}/editions")
editions = r.json()['editions']
for edition in editions[0:5]: ## Let's only sample a couple entries
    print(json.dumps(edition, indent=2, sort_keys=True, ensure_ascii=False))

You can also use the little python function editionIdByManuscriptName here to find a edition_id in the API response by its canonical manuscript name. The function returns a list, since there may be more than one version of the edition; the first version of the edition listed is the parent from which all others were forked.



In [ ]:

    
def editionIdByManuscriptName(name):
    eid = []
    for edition in editions:
        for version in edition:
            if name == version['name']:
                eid.append(version['id'])
    return eid

manuscriptName = '4Q51'
selectedEdition = editionIdByManuscriptName(manuscriptName)
if len(selectedEdition) > 0:
    selectedEdition = selectedEdition[0]
print(f"The edition id for primary version of {manuscriptName} is {selectedEdition}.")

Information about a specific edition

The API transaction editions/{editionId} will provide detailed information about the requested edition including its primary version and any derivative versions.



In [ ]:

    
r = requests.get(f"{api}/editions/{selectedEdition}")
edition = r.json()
print(json.dumps(edition, indent=2, sort_keys=True, ensure_ascii=False))

Information about the transcribed text

Text in the SQE database is divided into sections of (presumably) continuous text called "text fragments". The text fragments are composed of lines, the lines are further composed of signs. Each sign can be part of one or more ordering schemes, can have one or more interpretations, and can be linked to one or more words.

The GET editions/{editionId}/text-fragments endpoint returns the list of text fragments for an edition, in the editor's suggested order.



In [ ]:

    
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments")
textFragments = r.json()["textFragments"]
for textFragment in textFragments[0:min(len(textFragments), 10)]: ## Let's just look at the first ten
    pprint(textFragment, indent=2)
selectedTextFragment = textFragments[0]["id"]

Transcriptions

There are several different ways to work with transcribed text. After downloading it with the GET editions/{editionId}/text-fragments/{textFragmentId} endpoint, you may want to serialize it into something more human freindly or better suited to your computational analysis. The transcriptions in the database are a DAG, but this call provides ordered arrays along with the necessary information to parse the DAG. The object returned is fairly complex, so I will go through it step by step. The returned object has the following schema, which is explained in detail below.



In [ ]:

    
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment}")
text = r.json()

builder = SchemaBuilder()
builder.add_object(text)
print(json.dumps(builder.to_schema(), indent=2, sort_keys=False, ensure_ascii=False))

An actual object looks like this.



In [ ]:

    
print(json.dumps(text, indent=2, sort_keys=True, ensure_ascii=False))

Structure of the text object

The text object contains several top level properties. It contains a lincense with the copyright holder and collaborators automatically generated from the user information in the database. It provides a list of editors (this serves as a key for all the editorId properties at all levels of the text object). And it provides edition name and a unique manuscriptId.



In [ ]:

    
trimmedTextObject = copy.deepcopy(text)
del trimmedTextObject["textFragments"]

pprint(text, depth=3)

Nested objects

The textFragments property contains a list of text fragments. In this case we asked for only one, so there is only one entity in the list. Each text fragment entity has a list of lines, which provides the line name, the line id, and a list of signs in the line (the signs have been removed here to make it more readable).



In [ ]:

    
pprint(text["textFragments"][0], depth=3)

Lines and Sign interpretation metadata

The line contains a list of signs, each of which will contain a list of interpretations and of possible next interpretations. The next interpretation ids can be used to reconstruct all possible reading orders of the signs. The order of signs in the list is the default ordering, which should match the order of the text on the manuscript itself. Each element will have one or more sign interpretaions in the "signInterpretations" property. These entities have an id a "signInterpretation" which may be a character or may be empty if the sign interpretation has to do with formatting (like a space, or start of damage, etc.). The formatting metadata associated with the sign interpretation is in the "attributes" entity. Each attribute has an id, a code, and possible a numerical value. The codes are:

attribute_value_id	name	string_value	description
1	sign_type	LETTER	Type of char
2	sign_type	SPACE	Type of char
3	sign_type	POSSIBLE_VACAT	Type of char
4	sign_type	VACAT	Type of char
5	sign_type	DAMAGE	Type of char
6	sign_type	BLANK LINE	Type of char
7	sign_type	PARAGRAPH_MARKER	Type of char
8	sign_type	LACUNA	Type of char
9	sign_type	BREAK	Type of char
10	break_type	LINE_START	Defines a Metasign as marking of line
11	break_type	LINE_END	Defines a Metasign as marking of line
12	break_type	COLUMN_START	Defines a Metasign as marking of line
13	break_type	COLUMN_END	Defines a Metasign as marking of line
14	break_type	MANUSCRIPT_START	Defines a Metasign as marking of line
15	break_type	MANUSCRIPT_END	Defines a Metasign as marking of line
17	might_be_wider	TRUE	Set to true if the width of the sign mght be wider than the given width
18	readability	INCOMPLETE_BUT_CLEAR	The trad. DJD marking of readability
19	readability	INCOMPLETE_AND_NOT_CLEAR	The trad. DJD marking of readability
20	is_reconstructed	TRUE	true if the letter is totally reconstructed (brackets are not part of the sign stream!)
21	editorial_flag	CONJECTURE	Opinions of the editor like conjecture
22	editorial_flag	SHOULD_BE_ADDED	Opinions of the editor like conjecture
23	editorial_flag	SHOULD_BE_DELETED	Opinions of the editor like conjecture
24	correction	OVERWRITTEN	Correction marks added by a scribe
25	correction	HORIZONTAL_LINE	Correction marks added by a scribe
26	correction	DIAGONAL_LEFT_LINE	Correction marks added by a scribe
27	correction	DIAGONAL_RIGHT_LINE	Correction marks added by a scribe
28	correction	DOT_BELOW	Correction marks added by a scribe
29	correction	DOT_ABOVE	Correction marks added by a scribe
30	correction	LINE_BELOW	Correction marks added by a scribe
31	correction	LINE_ABOVE	Correction marks added by a scribe
32	correction	BOXED	Correction marks added by a scribe
33	correction	ERASED	Correction marks added by a scribe
34	relative_position	ABOVE_LINE	Position relative to line context
35	relative_position	BELOW_LINE	Position relative to line context
36	relative_position	LEFT_MARGIN	Position relative to line context
37	relative_position	RIGHT_MARGIN	Position relative to line context
38	relative_position	MARGIN	Position relative to line context
39	relative_position	UPPER_MARGIN	Position relative to line context
40	relative_position	LOWER_MARGIN	Position relative to line context



In [ ]:

    
trimmedSigns = text["textFragments"][0]["lines"][0]["signs"]
for sign in trimmedSigns[0:10]:
    pprint(sign)

Serializing the data to a string

Perhaps the most simple output type for this data would be a string representation. This can be achieved by iterating over the data and building a string representation. In this example we will omit reconstructed text (i.e., text with an attribute having the id 20, see line 18 below).



In [ ]:

    
def readFragments(text):
    formattedString = ""
    for textFragment in text['textFragments']:
        formattedString += f"\nText fragment {textFragment['textFragmentName']}:\n"
        formattedString = readLines(textFragment, formattedString)
        
    return formattedString

def readLines(textFragment, formattedString):
    for line in textFragment['lines']:
        formattedString += f"line {line['lineName']}:\n"
        formattedString = readSigns(line, formattedString) + "\n"
        
    return formattedString

def readSigns(line, formattedString):
    for signs in line['signs']:
        for signInterpretation in signs['signInterpretations']:
            attributes = list(map(lambda x: x['attributeValueId'], signInterpretation['attributes'])) ## Get a list of attribute ids
            if 20 not in attributes: ## let's omit reconstructions (attribute id 20)
                if 1 in attributes: ## id 1 marks a letter
                    formattedString += signInterpretation['signInterpretation']
                elif 2 in attributes: ## id 2 marks a space
                    formattedString += " "
                
    return formattedString
        
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment + 3}") ## Let's grab a bigger text
text = r.json()

print(readFragments(text))

Serializing the data to a simpler object

We can also serialize the data to a more simple data structure for computational purposes.



In [ ]:

    
r = requests.get(f"{api}/editions/{selectedEdition}/text-fragments/{selectedTextFragment + 3}") ## Let's grab a bigger text
text = r.json()

simplifiedTextObject = {}
for textFragment in text['textFragments']:
    simplifiedTextObject[textFragment["textFragmentName"]] = []
    
    for line in textFragment['lines']:
        lineObject = {}
        lineObject[line['lineName']] = []
        
        for sign in line['signs']:
            for signInterpretation in sign['signInterpretations']:
                attributes = list(map(lambda x: x['attributeValueId'], signInterpretation['attributes'])) ## Get a list of attribute ids
                if 20 not in attributes: ## let's omit reconstructions (attribute id 20)
                    if 1 in attributes: ## id 1 marks a letter
                        lineObject[line['lineName']].append(signInterpretation['signInterpretation'])
                    elif 2 in attributes: ## id 2 marks a space
                        lineObject[line['lineName']].append(" ")
                        
        simplifiedTextObject[textFragment["textFragmentName"]].append(lineObject)

pprint(simplifiedTextObject, indent=2)