Parsing and Wrapping C with Clang and Julia

Clang is an open-source compiler built on the LLVM framework and targeting C, C++, and Objective-C (LLVM is also the JIT backend for Julia). Due to a highly modular design, Clang has in recent years become the core of a growing number of projects utilizing pieces of the compiler, such as tools for source-to-source translation, static analysis and security evaluation, and editor tools for code completion, formatting, etc.

While LLVM and Clang are written in C++, the Clang project maintains a C-exported interface called "libclang" which provides access to the abstract syntax tree and type representations. Thanks to the ubiquity of support for C calling conventions, a number of languages have utilized libclang as a basis for tooling related to C and C++.

The Clang.jl Julia package wraps libclang, provides a small convenience API for Julia-style programming, and provides a C-to-Julia wrapper generator built on libclang functionality.

In this post I will introduce Clang.jl and explore both C parsing and C/Julia wrapper generation. All code in this example is available for download (example.h contains all of the C code referenced herein).

Setting up

The examples in this post require Julia 0.2 and libclang >= 3.1

Clang.jl should be installed using the Julia package manager:


In [1]:
Pkg.add("Clang")


INFO: Nothing to be done

Please see the Clang.jl README for dependency information.

Example 1: Printing Struct Fields

To motivate the discussion with a succinct example, consider this struct:

struct ExStruct {
    int    kind;
    char*  name;
    float* data;
};

Parsing and querying the fields of this struct requires just a few lines of code:


In [2]:
using Clang.cindex

top = cindex.parse_header("example.h")

st_cursor = cindex.search(top, "ExStruct")[1]

for c in children(st_cursor)
    println("Cursor: ", c, "\n  Name: ", name(c), "\n  Type: ", cu_type(c))
end


Cursor: CLCursor (FieldDecl) kind
  Name: kind
  Type: CLType (IntType) 
Cursor: CLCursor (FieldDecl) name
  Name: name
  Type: CLType (Pointer) 
Cursor: CLCursor (FieldDecl) data
  Name: data
  Type: CLType (Pointer) 

AST Representation

Let's examine the example above, starting with the variable top:


In [3]:
top


Out[3]:
CLCursor (TranslationUnit) example.h

A TranslationUnit is the entry point to the libclang AST. In the example above, top is the TranslationUnit CLCursor for the parsed file example.h. The libclang AST is represented as a directed acyclic graph of Cursor nodes carrying three pieces of essential information:

Kind: purpose of cursor node
Type: type of the object represented by cursor
Children: list of child nodes

In Clang.jl the cursor type is encapsulated by a Julia type deriving from the abstract type CLCursor. Under the hood, libclang represents each cursor (CXCursor) kind and type (CXType) as an enum value. These enum values are used to automatically map all CXCursor and CXType objects to Julia types. Thus, it is possible to write multiple-dispatch methods against CLCursor or CLType variables.

The example demonstrates two different ways of accessing child nodes of a given Cursor. Here, the children function returns an iterator over the child nodes of the given cursor:

for c in children(st_cursor)

And here, the search function returns a list of child node(s) matching the given name:

st_cursor = cindex.search(top, "ExStruct")[1]

Type representation

Example 1 also demonstrates querying of the Type associated with a given cursor using the cu_type helper function. In the output:

Cursor: CLCursor (FieldDecl) kind
  Name: kind
  Type: CLType (IntType) 
Cursor: CLCursor (FieldDecl) name
  Name: name
  Type: CLType (Pointer) 
Cursor: CLCursor (FieldDecl) data
  Name: data
  Type: CLType (Pointer) 

Each FieldDecl cursor has an associated CLType object, with an identity reflecting the field type for the given struct member. It is critical to note the difference between the representation for the kind field and the name and data fields. kind is represented directly as an IntType object, but kind and data are represented as Pointer CLTypes. As explored in the next section, the full type of the Pointer can be queried to retrieve the full char\* and float\* types of these members. User-defined types are captured using a similar scheme.

Example 2: function arguments and types

To further explore type representations, consider the following function (included in example.h):

void* ExFunction (int kind, char* name, float* data) {
    struct ExStruct st;
    st.kind = kind;
    st.name = name;
    st.data = data;
}

To find the cursor for this function declaration, we use the overloaded version of cindex.search to retrieve nodes with type FunctionDecl, and select the final one in the list:


In [4]:
fdecl = cindex.search(top, FunctionDecl)[end]
fdecl_children = [c for c in children(fdecl)]


Out[4]:
4-element Array{Any,1}:
 CLCursor (ParmDecl) kind
 CLCursor (ParmDecl) name
 CLCursor (ParmDecl) data
 CLCursor (CompoundStmt) 

The first three children are ParmDecl cursors with the same name as the arguments in the function signature. Checking the types of the ParmDecl cursors indicates a similarity to the function signature:


In [5]:
[cu_type(t) for t in fdecl_children[1:3]]


Out[5]:
3-element Array{Any,1}:
 CLType (IntType) 
 CLType (Pointer) 
 CLType (Pointer) 

And, finally, retrieving the target type of each Pointer argument confirms that these cursors represent the function argument type declaration:


In [6]:
[cindex.pointee_type(cu_type(t)) for t in fdecl_children[2:3]]


Out[6]:
2-element Array{Any,1}:
 CLType (Char_S) 
 CLType (Float)  

Example 3: Printing Indented Cursor Hierarchy

As a closing example, here is a simple, indented AST printer using CLType- and CLCursor-related functions, and utilizing various aspects of Julia's type system.


In [7]:
printind(ind::Int, st...)              = println(join([repeat(" ", 2*ind), st...]))

printobj(cursor::cindex.CLCursor)      = printobj(0, cursor)
printobj(t::cindex.CLType)             = join(typeof(t), " ", spelling(t))
printobj(t::cindex.IntType)            = t
printobj(t::cindex.Pointer)            = cindex.pointee_type(t)
printobj(ind::Int, t::cindex.CLType)   = printind(ind, printobj(t))

function printobj(ind::Int, cursor::Union(cindex.FieldDecl, cindex.ParmDecl))
    printind(ind+1, typeof(cursor), " ", printobj(cu_type(cursor)), " ", name(cursor))
end

function printobj(ind::Int, node::Union(cindex.CLCursor,
                                        cindex.StructDecl,
                                        cindex.CompoundStmt,
                                        cindex.FunctionDecl,
                                        cindex.BinaryOperator) )
    printind(ind, " ", typeof(node), " ", name(node))
    for c in children(node)
        printobj(ind + 1, c)
    end
end


Out[7]:
printobj (generic function with 7 methods)

In [8]:
printobj(top)


 TranslationUnit example.h
   TypedefDecl __int128_t
   TypedefDecl __uint128_t
   TypedefDecl __builtin_va_list
     TypeRef __va_list_tag
   StructDecl ExStruct
      FieldDecl CLType (IntType)  kind
      FieldDecl CLType (Char_S)  name
      FieldDecl CLType (Float)  data
   FunctionDecl ExFunction(int, char *, float *)
      ParmDecl CLType (IntType)  kind
      ParmDecl CLType (Char_S)  name
      ParmDecl CLType (Float)  data
     CompoundStmt 
       LastStmt 
         VarDecl st
           TypeRef struct ExStruct
       BinaryOperator 
         MemberRefExpr kind
           DeclRefExpr st
         FirstExpr kind
           DeclRefExpr kind
       BinaryOperator 
         MemberRefExpr name
           DeclRefExpr st
         FirstExpr name
           DeclRefExpr name
       BinaryOperator 
         MemberRefExpr data
           DeclRefExpr st
         FirstExpr data
           DeclRefExpr data

Note that a generic printobj function has been defined for the abstract CLType and CLCursor types, and multiple dispatch is used to define the printers for various specific types needing custom behavior. In particular, the following function handles all cursor types for which recursive printing of child nodes is required:

function printobj(ind::Int, node::Union(cindex.CLCursor,
                                        cindex.StructDecl,
                                        cindex.CompoundStmt,
                                        cindex.FunctionDecl) )

Parsing Summary

As discussed above, there are several key aspects of the Clang.jl/libclang API:

  • tree of Cursor nodes representing the AST, notes have unique children.
  • each Cursor node has a Julia type identifying the syntactic construct represented by the node.
  • each node also has an associated CLType referencing either intrinsic or user-defined datatypes.

There are a number of details omitted from this post, especially concerning the full variety of CLCursor and CLType representations available via libclang. For further information, please see the Clang.jl documentation and the libclang documentation.

C to Julia Wrapper Generation

The Clang.jl repository also hosts a Julia wrapper generator built on the functionality introduced above. The wrapper generator supports automatic translation of:

  • Functions: Generates Julia stub and corresponding ccall.
  • Function arguments: Translated to Julia types.
  • Constants: Translated to Julia const declarations.
  • Preprocessor constants: Also translated to const declarations.
  • Typedef: Translated to Julia typealias.
  • Structs: Partial support for struct with intrinsically-typed fields (pointers work, but no fixed-size arrays, no nested structs, no unions).

The wrapper generator has two public functions: init and wrap_c_headers:

  • init: creates a WrapContext instance capturing all options for a given wrapping.
  • wrap_c_headers: runs the wrapping on a list of header files.

init is the centerpiece of the API, accepting options and flags to be passed to Clang. For successful parsing, the most important options are Clang-related: clang_includes (header search directories - critical!), and clang_args. The output_file argument will set the name of the generated file (defaults to the header name), and the common_file argument sets the name of the file containing typealias, enum, and constant declarations (defaults to common_file; these declarations are printed before function declarations).

To allow fine-grained customization of output, init also accepts three callback functions:

header_library: return name of shared library for a given header filename [mandatory, but often a constant]

header_wrapped: arguments: (headerfile, cursorname) pair, returns Bool if/not the pair should be wrapped [default: true]
header_outputfile: return output filename for a given header [default: header name]

To demonstrate this functionality, the following code will create a wrapper for the libjpeg library (typically /usr/include/jpeglib.h on Linux systems):


In [9]:
using Clang.wrap_c

context = wrap_c.init(; output_file="libjpeg.jl", header_library=x->"libjpeg", common_file="libjpeg_h.jl", clang_diagnostics=true)
context.options.wrap_structs = true
wrap_c.wrap_c_headers(context, ["/usr/include/jpeglib.h"])


WRAPPING HEADER: /usr/include/jpeglib.h
/usr/include/jpeglib.h:791:3: error: unknown type name 'size_t'
/usr/include/jpeglib.h:803:3: error: unknown type name 'size_t'
/usr/include/jpeglib.h:835:5: error: unknown type name 'size_t'
/usr/include/jpeglib.h:837:10: error: unknown type name 'size_t'
/usr/include/jpeglib.h:990:24: error: unknown type name 'size_t'
/usr/include/jpeglib.h:992:19: error: unknown type name 'size_t'
/usr/include/jpeglib.h:999:57: error: unknown type name 'FILE'
/usr/include/jpeglib.h:1000:58: error: unknown type name 'FILE'

WARNING: Skipping unnamed StructDecl
WARNING: Skipping unnamed StructDecl
WARNING: Skipping unnamed StructDecl
WARNING: Skipping unnamed StructDecl
WARNING: Skipping empty struct: "jpeg_marker_struct"
WARNING: Skipping empty struct: "jpeg_compress_struct"
WARNING: Skipping empty struct: "jpeg_decompress_struct"
WARNING: Skipping empty struct: "jvirt_sarray_control"
WARNING: Skipping empty struct: "jvirt_barray_control"

While that was quite easy, note that this is a raw wrapping of a C library, and bad data or incorrect arguments will often cause segfaults just like in C. For this reason, such automated wrapper generation is just one step in the creation of a safe and "Julian" API on top of a C library.

Acknowledgement

Eli Bendersky's post Parsing C++ in Python with Clang has been an extremely helpful reference.