This tutorial is designed to help you start using uproot. Unlike the reference documentation, which defines every parameter of every function, this tutorial provides introductory examples to help you learn how to use them.
The original tutorial has been archived—this version was written in June 2019 in response to feedback from a series of tutorials I presented early this year and common questions in the GitHub issues. The new tutorial is executable on Binder and may be read in any order, though it has to be executed from top to bottom because some variables are reused.
Uproot is a Python package; it is pip and conda-installable, and it only depends on other Python packages. Although it is similar in function to root_numpy and root_pandas, it does not compile into ROOT and therefore avoids issues in which the version used in compilation differs from the version encountered at runtime.
In short, you should never see a segmentation fault.
Uproot is strictly concerned with file I/O only—all other functionality is handled by other libraries:
In the past year, uproot has become one of the most widely used Python packages made for particle physics, with users in all four LHC experiments, theory, neutrino experiments, XENON-nT (dark matter direct detection), MAGIC (gamma ray astronomy), and IceCube (neutrino astronomy).
uproot.open is the entry point for reading a single file.
It takes a local filename path or a remote http://
or root://
URL. (HTTP requires the Python requests library and XRootD requires pyxrootd, both of which have to be explicitly pip-installed if you installed uproot with pip, but are automatically installed if you installed uproot with conda.)
In [1]:
import uproot
file = uproot.open("https://scikit-hep.org/uproot/examples/nesteddirs.root")
file
Out[1]:
uproot.open returns a ROOTDirectory, which behaves like a Python dict; it has keys()
, values()
, and key-value access with square brackets.
In [2]:
file.keys()
Out[2]:
In [3]:
file["one"]
Out[3]:
Subdirectories also have type ROOTDirectory, so they behave like Python dicts, too.
In [4]:
file["one"].keys()
Out[4]:
In [5]:
file["one"].values()
Out[5]:
What's the b
before each object name? Python 3 distinguishes between bytestrings and encoded strings. ROOT object names have no encoding, such as Latin-1 or Unicode, so uproot presents them as raw bytestrings. However, if you enter a Python string (no b
) and it matches an object name (interpreted as plain ASCII), it will count as a match, as "one"
does above.
What's the ;1
after each object name? ROOT objects are versioned with a "cycle number." If multiple objects are written to the ROOT file with the same name, they will have different cycle numbers, with the largest value being last. If you don't specify a cycle number, you'll get the latest one.
This file is deeply nested, so while you could find the TTree with
In [6]:
file["one"]["two"]["tree"]
Out[6]:
you can also find it using a directory path, with slashes.
In [7]:
file["one/two/tree"]
Out[7]:
Here are a few more tricks for finding your way around a file:
keys()
, values()
, and items()
methods have allkeys()
, allvalues()
, allitems()
variants that recursively search through all subdirectories;Here's how you would search the subdirectories to find all TTrees:
In [8]:
file.allkeys(filterclass=lambda cls: issubclass(cls, uproot.tree.TTreeMethods))
Out[8]:
Or get a Python dict of them:
In [9]:
all_ttrees = dict(file.allitems(filterclass=lambda cls: issubclass(cls, uproot.tree.TTreeMethods)))
all_ttrees
Out[9]:
Be careful: Python 3 is not as forgiving about matching key names. all_ttrees
is a plain Python dict, so the key must be a bytestring and must include the cycle number.
In [10]:
all_ttrees[b"one/two/tree;1"]
Out[10]:
Objects in ROOT files can be uncompressed, compressed with ZLIB, compressed with LZMA, or compressed with LZ4. Uproot picks the right decompressor and gives you the objects transparently: you don't have to specify anything. However, if an object is compressed with LZ4 and you don't have the lz4 library installed, you'll get an error with installation instructions in the message. (It is automatically installed if you installed uproot with conda.) ZLIB is part of the Python Standard Library, and LZMA is part of the Python 3 Standard Library, so you won't get error messages about these except for LZMA in Python 2 (for which there is backports.lzma, automatically installed if you installed uproot with conda).
The ROOTDirectory class has a compression
property that tells you the compression algorithm and level associated with this file,
In [11]:
file.compression
Out[11]:
but any object can be compressed with any algorithm at any level—this is only the default compression for the file. Some ROOT files are written with each TTree branch compressed using a different algorithm and level.
TTrees are special objects in ROOT files: they contain most of the physics data. Uproot presents TTrees as subclasses of TTreeMethods.
(Why subclass? Different ROOT files can have different versions of a class, so uproot generates Python classes to fit the data, as needed. All TTrees inherit from TTreeMethods so that they get the same data-reading methods.)
In [12]:
events = uproot.open("https://scikit-hep.org/uproot/examples/Zmumu.root")["events"]
events
Out[12]:
Although TTreeMethods objects behave like Python dicts of TBranchMethods objects, the easiest way to browse a TTree is by calling its show()
method, which prints the branches and their interpretations as arrays.
In [13]:
events.keys()
Out[13]:
In [14]:
events.show()
Basic information about the TTree, such as its number of entries, are available as properties.
In [15]:
events.name, events.title, events.numentries
Out[15]:
ROOT files contain objects internally referred to via TKeys
(dict-like lookup in uproot). TTree
organizes data in TBranches
, and uproot interprets one TBranch
as one array, either a Numpy array or an awkward array. TBranch
data are stored in chunks called TBaskets
, though uproot hides this level of granularity unless you dig into the details.
The bulk data in a TTree are not read until requested. There are many ways to do that:
Let's start with the simplest.
In [16]:
a = events.array("E1")
a
Out[16]:
Since array
is singular, you specify one branch name and get one array back. This is a Numpy array of 8-byte floating point numbers, the Numpy dtype specified by the "E1"
branch's interpretation.
In [17]:
events["E1"].interpretation
Out[17]:
We can use this array in Numpy calculations; see the Numpy documentation for details.
In [18]:
import numpy
numpy.log(a)
Out[18]:
Numpy arrays are also the standard container for entering data into machine learning frameworks; see this Keras introduction, PyTorch introduction, TensorFlow introduction, or Scikit-Learn introduction to see how to put Numpy arrays to work in machine learning.
The TBranchMethods.array method is the same as TTreeMethods.array except that you don't have to specify the TBranch name (naturally). Sometimes one is more convenient, sometimes the other.
In [19]:
events.array("E1"), events["E1"].array()
Out[19]:
The plural arrays
method is different. Whereas singular array
could only return one array, plural arrays
takes a list of names (possibly including wildcards) and returns them all in a Python dict.
In [20]:
events.arrays(["px1", "py1", "pz1"])
Out[20]:
In [21]:
events.arrays(["p[xyz]*"])
Out[21]:
As with all ROOT object names, the TBranch names are bytestrings (prepended by b
). If you know the encoding or it doesn't matter ("ascii"
and "utf-8"
are generic), pass a namedecode
to get keys that are strings.
In [22]:
events.arrays(["p[xyz]*"], namedecode="utf-8")
Out[22]:
These array-reading functions have many parameters, but most of them have the same names and meanings across all the functions. Rather than discuss all of them here, they'll be presented in context in sections on special features below.
Every time you ask for arrays, uproot goes to the file and re-reads them. For especially large arrays, this can take a long time.
For quicker access, uproot's array-reading functions have a cache parameter, which is an entry point for you to manage your own cache. The cache only needs to behave like a dict (many third-party Python caches do).
In [23]:
mycache = {}
# first time: reads from file
events.arrays(["p[xyz]*"], cache=mycache);
# any other time: reads from cache
events.arrays(["p[xyz]*"], cache=mycache);
In this example, the cache is a simple Python dict. Uproot has filled it with unique ID → array pairs, and it uses the unique ID to identify an array that it has previously read. You can see that it's full by looking at those keys:
In [24]:
mycache
Out[24]:
though they're not very human-readable.
If you're running out of memory, you could manually clear your cache by simply clearing the dict.
In [25]:
mycache.clear()
mycache
Out[25]:
Now the same line of code reads from the file again.
In [26]:
# not in cache: reads from file
events.arrays(["p[xyz]*"], cache=mycache);
This manual process of clearing the cache when you run out of memory is not very robust. What you want instead is a dict-like object that drops elements on its own when memory is scarce.
Uproot has an ArrayCache class for this purpose, though it's a thin wrapper around the third-party cachetools library. Whereas cachetools drops old data from cache when a maximum number of items is reached, ArrayCache drops old data when the data usage reaches a limit, specified in bytes.
In [27]:
mycache = uproot.ArrayCache("100 kB")
events.arrays("*", cache=mycache);
len(mycache), len(events.keys())
Out[27]:
With a limit of 100 kB, only 6 of the 20 arrays fit into cache, the rest have been evicted.
All data sizes in uproot are specified as an integer in bytes (integers) or a string with the appropriate unit (interpreted as powers of 1024, not 1000).
The fact that any dict-like object may be a cache opens many possibilities. If you're struggling with a script that takes a long time to load data, then crashes, you may want to try a process-independent cache like memcached. If you have a small, fast disk, you may want to consider diskcache to temporarily hold arrays from ROOT files on the big, slow disk.
All of the array-reading functions have a cache parameter to accept a cache object. This is the high-level cache, which caches data after it has been fully interpreted. These functions also have a basketcache parameter to cache data after reading and decompressing baskets, but before interpretation as high-level arrays. The main purpose of this is to avoid reading TBaskets twice when an iteration step falls in the middle of a basket (see below). There is also a keycache for caching ROOT's TKey objects, which use negligible memory but would be a bottleneck to re-read when TBaskets are provided by a basketcache.
For more on these high and mid-level caching parameters, see reference documentation.
At the lowest level of abstraction, raw bytes are cached by the HTTP and XRootD remote file readers. You can control the memory remote file memory use with uproot.HTTPSource.defaults["limitbytes"]
and uproot.XRootDSource.defaults["limitbytes"]
, either by globally setting these parameters before opening a file, or by passing them to uproot.open through the limitbytes parameter.
In [28]:
# default remote file caches in MB
uproot.HTTPSource.defaults["limitbytes"] / 1024**2, uproot.XRootDSource.defaults["limitbytes"] / 1024**2
Out[28]:
If you want to limit this cache to less than the default chunkbytes of 1 MB, be sure to make the chunkbytes smaller, so that it's able to load at least one chunk!
In [29]:
uproot.open("https://scikit-hep.org/uproot/examples/Zmumu.root", limitbytes="100 kB", chunkbytes="10 kB")
Out[29]:
By default (unless localsource is overridden), local files are memory-mapped, so the operating system manages its byte-level cache.
If you call TBranchMethods.array, TTreeMethods.array, or TTreeMethods.arrays, uproot reads the file or cache immediately and returns an in-memory array. For exploratory work or to control memory usage, you might want to let the data be read on demand.
The TBranch.lazyarray, TTreeMethods.lazyarray, TTreeMethods.lazyarrays, and uproot.lazyarrays functions take most of the same parameters but return lazy array objects, rather than Numpy arrays.
In [30]:
data = events.lazyarrays("*")
data
Out[30]:
This ChunkedArray
represents all the data in the file in chunks specified by ROOT's internal baskets (specifically, the places where the baskets align, called "clusters"). Each chunk contains a VirtualArray
, which is read when any element from it is accessed.
In [31]:
data = events.lazyarrays(entrysteps=500) # chunks of 500 events each
dataE1 = data["E1"]
dataE1
Out[31]:
Requesting "E1"
through all the chunks and printing it (above) has caused the first and last chunks of the array to be read, because that's all that got written to the screen. (See the ...
?)
In [32]:
[chunk.ismaterialized for chunk in dataE1.chunks]
Out[32]:
These arrays can be used with Numpy's universal functions (ufuncs), which are the mathematical functions that perform elementwise mathematics.
In [33]:
numpy.log(dataE1)
Out[33]:
Now all of the chunks have been read, because the values were needed to compute log(E1)
for all E1
.
In [34]:
[chunk.ismaterialized for chunk in dataE1.chunks]
Out[34]:
(Note: only ufuncs recognize these lazy arrays because Numpy provides a mechanism to override ufuncs but a similar mechanism for high-level functions is still in development. To turn lazy arrays into Numpy arrays, pass them to the Numpy constructor, as shown below. This causes the whole array to be loaded into memory and to be stitched together into a contiguous whole.)
In [35]:
numpy.array(dataE1)
Out[35]:
There's a lazy version of each of the array-reading functions in TTreeMethods and TBranchMethods, but there's also module-level uproot.lazyarray and uproot.lazyarrays. These functions let you make a lazy array that spans many files.
These functions may be thought of as alternatives to ROOT's TChain: a TChain presents many files as though they were a single TTree, and a file-spanning lazy array presents many files as though they were a single array. See Iteration below as a more explicit TChain alternative.
In [36]:
data = uproot.lazyarray(
# list of files; local files can have wildcards (*)
["samples/sample-%s-zlib.root" % x
for x in ["5.23.02", "5.24.00", "5.25.02", "5.26.00", "5.27.02", "5.28.00",
"5.29.02", "5.30.00", "6.08.04", "6.10.05", "6.14.00"]],
# TTree name in each file
"sample",
# branch(s) in each file for lazyarray(s)
"f8")
data
Out[36]:
This data
represents the entire set of files, and the only up-front processing that had to be done was to find out how many entries each TTree contains.
It uses the uproot.numentries shortcut method (which reads less data than normal file-opening):
In [37]:
dict(uproot.numentries(
# list of files; local files can have wildcards (*)
["samples/sample-%s-zlib.root" % x
for x in ["5.23.02", "5.24.00", "5.25.02", "5.26.00", "5.27.02", "5.28.00",
"5.29.02", "5.30.00", "6.08.04", "6.10.05", "6.14.00"]],
# TTree name in each file
"sample",
# total=True adds all values; total=False leaves them as a dict
total=False))
Out[37]:
By default, lazy arrays hold onto all data that have been read as long as the lazy array continues to exist. To use a lazy array as a window into a very large dataset, you'll have to limit how much it's allowed to keep in memory at a time.
This is caching, and the caching mechanism is the same as before:
In [38]:
mycache = uproot.cache.ArrayCache(100*1024) # 100 kB
data = events.lazyarrays(entrysteps=500, cache=mycache)
data
Out[38]:
Before performing a calculation, the cache is empty.
In [39]:
len(mycache)
Out[39]:
In [40]:
numpy.sqrt((data["E1"] + data["E2"])**2 - (data["px1"] + data["px2"])**2 -
(data["py1"] + data["py2"])**2 - (data["pz1"] + data["pz2"])**2)
Out[40]:
After performing the calculation, the cache contains only as many chunks as it could hold.
In [41]:
# chunks in cache chunks touched to compute (E1 + E2)**2 - (px1 + px2)**2 - (py1 + py2)**2 - (pz1 + pz2)**2
len(mycache), len(data["E1"].chunks) * 8
Out[41]:
The ChunkedArray
and VirtualArray
classes are defined in the awkward-array library installed with uproot. These arrays can be saved to files in a way that preserves their virtualness, which allows you to save a "diff" with respect to the original ROOT files.
Below, we load lazy arrays from a ROOT file with persistvirtual=True and add a derived feature:
In [42]:
data = events.lazyarrays(["E*", "p[xyz]*"], persistvirtual=True)
data["mass"] = numpy.sqrt((data["E1"] + data["E2"])**2 - (data["px1"] + data["px2"])**2 -
(data["py1"] + data["py2"])**2 - (data["pz1"] + data["pz2"])**2)
and save the whole thing to an awkward-array file (.awkd
).
In [43]:
import awkward
awkward.save("derived-feature.awkd", data, mode="w")
When we read it back, the derived features come from the awkward-array file but the original features are loaded as pointers to the original ROOT files (VirtualArrays
whose array-making function knows the original ROOT filenames—don't move them!).
In [44]:
data2 = awkward.load("derived-feature.awkd")
In [45]:
# reads from derived-feature.awkd
data2["mass"]
Out[45]:
In [46]:
# reads from the original ROOT flies
data2["E1"]
Out[46]:
Similarly, a dataset with a cut applied saves the identities of the selected events but only pointers to the original ROOT data. This acts as a lightweight skim.
In [47]:
selected = data[data["mass"] < 80]
selected
Out[47]:
In [48]:
awkward.save("selected-events.awkd", selected, mode="w")
In [49]:
data3 = awkward.load("selected-events.awkd")
data3
Out[49]:
Dask is a framework for delayed and distributed computation with lazy array and dataframe interfaces. To turn uproot's lazy arrays into Dask objects, use the uproot.daskarray and uproot.daskframe functions.
In [50]:
uproot.daskarray("https://scikit-hep.org/uproot/examples/Zmumu.root", "events", "E1")
Out[50]:
In [51]:
uproot.daskframe("https://scikit-hep.org/uproot/examples/Zmumu.root", "events")
Out[51]:
Lazy arrays implicitly step through chunks of data to give you the impression that you have a larger array than memory can hold all at once. The next two methods explicitly step through chunks of data, to give you more control over the process.
TTreeMethods.iterate iterates over chunks of a TTree and uproot.iterate iterates through files.
Like a file-spanning lazy array, a file-spanning iterator erases the difference between files and may be used as a TChain alternative. However, the iteration is over chunks of many events, not single events.
In [52]:
histogram = None
for data in events.iterate(["E*", "p[xyz]*"], namedecode="utf-8"):
# operate on a batch of data in the loop
mass = numpy.sqrt((data["E1"] + data["E2"])**2 - (data["px1"] + data["px2"])**2 -
(data["py1"] + data["py2"])**2 - (data["pz1"] + data["pz2"])**2)
# accumulate results
counts, edges = numpy.histogram(mass, bins=120, range=(0, 120))
if histogram is None:
histogram = counts, edges
else:
histogram = histogram[0] + counts, edges
In [53]:
%matplotlib inline
import matplotlib.pyplot
counts, edges = histogram
matplotlib.pyplot.step(x=edges, y=numpy.append(counts, 0), where="post");
matplotlib.pyplot.xlim(edges[0], edges[-1]);
matplotlib.pyplot.ylim(0, counts.max() * 1.1);
matplotlib.pyplot.xlabel("mass");
matplotlib.pyplot.ylabel("events per bin");
This differs from the lazy array approach in that you need to explicitly manage the iteration, as in this histogram accumulation. However, since we aren't caching, the previous array batch is deleted as soon as data
goes out of scope, so it is easier to control which arrays are in memory and which aren't.
Choose lazy arrays or iteration according to the degree of control you need.
uproot.iterate crosses file boundaries as part of its iteration, and that's information we might need in the loop. If the following are True
, each step in iteration is a tuple containing the arrays and the additional information.
In [54]:
for path, file, start, stop, arrays in uproot.iterate(
["https://scikit-hep.org/uproot/examples/sample-%s-zlib.root" % x
for x in ["5.23.02", "5.24.00", "5.25.02", "5.26.00", "5.27.02", "5.28.00",
"5.29.02", "5.30.00", "6.08.04", "6.10.05", "6.14.00"]],
"sample",
"f8",
reportpath=True, reportfile=True, reportentries=True):
print(path, file, start, stop, len(arrays))
All array-reading functions have the following parameters:
0
;numentries
.Setting entrystart and/or entrystop differs from slicing the resulting array in that slicing reads, then discards, but these parameters minimize the data to read.
In [55]:
len(events.array("E1", entrystart=100, entrystop=300))
Out[55]:
As with Python slices, the entrystart and entrystop can be negative to count from the end of the TTree.
In [56]:
events.array("E1", entrystart=-10)
Out[56]:
Internally, ROOT files are written in chunks and whole chunks must be read, so the best places to set entrystart and entrystop are between basket boundaries.
In [57]:
# This file has small TBaskets
tree = uproot.open("https://scikit-hep.org/uproot/examples/foriter.root")["foriter"]
branch = tree["data"]
[branch.basket_numentries(i) for i in range(branch.numbaskets)]
Out[57]:
In [58]:
# (entrystart, entrystop) pairs where ALL the TBranches' TBaskets align
list(tree.clusters())
Out[58]:
Or simply,
In [59]:
branch.baskets()
Out[59]:
In addition to entrystart and entrystop, the lazy array and iteration functions also have:
numpy.inf
for make the chunks/steps as big as possible (limited by file boundaries), a memory size string, or a list of (entrystart, entrystop)
pairs to be explicit.
In [60]:
[len(chunk) for chunk in events.lazyarrays(entrysteps=500)["E1"].chunks]
Out[60]:
In [61]:
[len(data[b"E1"]) for data in events.iterate(["E*", "p[xyz]*"], entrysteps=500)]
Out[61]:
The TTree lazy array/iteration functions (TTreeMethods.array, TTreeMethods.arrays, TBranch.lazyarray, TTreeMethods.lazyarray, and TTreeMethods.lazyarrays) use basket or cluster sizes as a default entrysteps, while multi-file lazy array/iteration functions (uproot.lazyarrays and uproot.iterate) use the maximum per file: numpy.inf
.
In [62]:
# This file has small TBaskets
tree = uproot.open("https://scikit-hep.org/uproot/examples/foriter.root")["foriter"]
branch = tree["data"]
[len(a["data"]) for a in tree.iterate(namedecode="utf-8")]
Out[62]:
In [63]:
# This file has small TBaskets
[len(a["data"]) for a in uproot.iterate(["https://scikit-hep.org/uproot/examples/foriter.root"] * 3,
"foriter", namedecode="utf-8")]
Out[63]:
One particularly useful way to specify the entrysteps is with a memory size string. This string consists of a number followed by a memory unit: B
for bytes, kB
for kilobytes, MB
, GB
, and so on (whitespace and case insensitive).
The chunks are not guaranteed to fit the memory size perfectly or even be less than the target size. Uproot picks a fixed number of events that approximates this size on average. The result depends on the number of branches chosen because it is the total size of the set of branches that are chosen for the memory target.
In [64]:
[len(data[b"E1"]) for data in events.iterate(["E*", "p[xyz]*"], entrysteps="50 kB")]
Out[64]:
In [65]:
[len(data[b"E1"]) for data in events.iterate(entrysteps="50 kB")]
Out[65]:
Since lazy arrays represent all branches but we won't necessarily be reading all branches, memory size chunking is less useful for lazy arrays, but you can do it because all function parameters are treated consistently.
In [66]:
[len(chunk) for chunk in events.lazyarrays(entrysteps="50 kB")["E1"].chunks]
Out[66]:
Since iteration gives you more precise control over which set of events you're processing at a given time, caching with the cache parameter is less useful than it is with lazy arrays. For consistency's sake, the TTreeMethods.iterate and uproot.iterate functions provide a cache parameter and it works the same way that it does in other array-reading functions, but its effect would be to retain the previous step's arrays while working on a new step in the iteration. Presumably, the reason you're iterating is because only the current step fits into memory, so this is not a useful feature.
However, the basketcache is very useful for iteration, more so than it is for lazy arrays. If an iteration step falls in the middle of a TBasket, the whole TBasket must be read in that step, despite the fact that only part of it is incorporated into the output array. The remainder of the TBasket will be used in the next iteration step, so caching it for exactly one iteration step is ideal: it avoids the need to reread it and decompress it again.
It is such a useful feature that it's built into TTreeMethods.iterate and uproot.iterate by default. If you don't set a basketcache, these functions will create one with no memory limit and save TBaskets in it for exactly one iteration step, eliminating that temporary cache at the end of iteration. (The same is true of the keycache; see reference documentation for detail.)
Thus, you probably don't want to set any explicit caches while iterating. Setting an explicit basketcache would introduce an upper limit on how much it can store, but it would lose the property of evicting after exactly one iteration step (because the connection between the cache object and the iterator would be lost). If you're running out of memory during iteration, try reducing the entrysteps.
When we ask for TTreeMethods.arrays (plural), TTreeMethods.iterate, or uproot.iterate, we get a Python dict mapping branch names to arrays. (As a reminder, namedecode="utf-8" makes those branch names Python strings, rather than bytestrings.) Sometimes, we want a different kind of container.
One particularly useful container is tuple
, which can be unpacked by a tuple-assignment.
In [67]:
px, py, pz = events.arrays("p[xyz]1", outputtype=tuple)
In [68]:
px
Out[68]:
Using tuple
as an outputtype in TTreeMethods.iterate and uproot.iterate lets us unpack the arrays in Python's for statement.
In [69]:
for px, py, pz in events.iterate("p[xyz]1", outputtype=tuple):
px**2 + py**2 + pz**2
Another useful type is collections.namedtuple
, which packs everything into a single object, but the fields are accessible by name.
In [70]:
import collections # from the Python standard library
a = events.arrays("p[xyz]1", outputtype=collections.namedtuple)
In [71]:
a.px1
Out[71]:
You can also use your own classes.
In [72]:
class Stuff:
def __init__(self, px, py, pz):
self.p = numpy.sqrt(px**2 + py**2 + pz**2)
def __repr__(self):
return "<Stuff %r>" % self.p
events.arrays("p[xyz]1", outputtype=Stuff)
Out[72]:
And perhaps most importantly, you can pass in pandas.DataFrame.
In [73]:
import pandas
events.arrays("p[xyz]1", outputtype=pandas.DataFrame, entrystop=10)
Out[73]:
The previous example filled a pandas.DataFrame by explicitly passing it as an outputtype. Pandas is such an important container type that there are specialized functions for it: TTreeMethods.pandas.df and uproot.pandas.df.
In [74]:
events.pandas.df("p[xyz]1", entrystop=10)
Out[74]:
The entry index in the resulting DataFrame represents the actual entry numbers in the file. For instance, counting from the end:
In [75]:
events.pandas.df("p[xyz]1", entrystart=-10)
Out[75]:
The uproot.pandas.df function doesn't have a reportentries because they're included in the DataFrame itself.
In [76]:
for df in uproot.pandas.iterate("https://scikit-hep.org/uproot/examples/Zmumu.root", "events", "p[xyz]1", entrysteps=500):
print(df[:3])
Part of the motivation for a special function is that it's the first of potentially many external connectors (Dask is another: see above). The other part is that these functions have more Pandas-friendly default parameters, such as flatten=True.
Flattening turns multiple values per entry (i.e. multiple particles per event) into separate DataFrame rows, maintaining the nested structure in the DataFrame index. Flattening is usually undesirable for arrays—because arrays don't have an index to record that information—but it's usually desirable for DataFrames.
In [77]:
events2 = uproot.open("https://scikit-hep.org/uproot/examples/HZZ.root")["events"] # non-flat data
In [78]:
events2.pandas.df(["MET_p*", "Muon_P*"], entrystop=10, flatten=False) # not the default
Out[78]:
DataFrames like the above are slow (the cell entries are Python lists) and difficult to use in Pandas. Pandas doesn't have specialized functions for manipulating this kind of structure.
However, if we use the default flatten=True:
In [79]:
df = events2.pandas.df(["MET_p*", "Muon_P*"], entrystop=10)
df
Out[79]:
The particles-within-events structure is encoded in the pandas.MultiIndex, and we can use Pandas functions like DataFrame.unstack to manipulate that structure.
In [80]:
df.unstack()
Out[80]:
There's also a flatten=None that skips all non-flat TBranches, included as a convenience against overzealous branch selection.
In [81]:
events2.pandas.df(["MET_p*", "Muon_P*"], entrystop=10, flatten=None)
Out[81]:
We have already seen that TBranches can be selected as lists of strings and with wildcards. This is the same wildcard pattern that filesystems use to match file lists: *
can be replaced with any text (or none), ?
can be replaced by one character, and [...]
specifies a list of alternate characters.
Wildcard patters are quick to write, but limited relative to regular expressions. Any branch request between slashes (/
inside the quotation marks) will be interpreted as regular expressions instead (i.e. .*
instead of *
).
In [82]:
events.arrays("p[xyz]?").keys() # using wildcards
Out[82]:
In [83]:
events.arrays("/p[x-z].?/").keys() # using regular expressions
Out[83]:
If, instead of strings, you pass a function from branch objects to True
or False
, the branches will be selected by evaluating the function as a filter. This is a way of selecting branches based on properties other than their names.
In [84]:
events.arrays(lambda branch: branch.compressionratio() > 3).keys()
Out[84]:
Note that the return values must be strictly True
and False
, not anything that Python evaluates to true or false. If the function returns anything else, it will be used as a new Interpretation for the branch.
In [85]:
events.show()
Every branch has a default interpretation, such as
In [86]:
events["E1"].interpretation
Out[86]:
meaning big-endian, 8-byte floating point numbers as a Numpy dtype. We could interpret this branch with a different Numpy dtype, but it wouldn't be meaningful.
In [87]:
events["E1"].array(uproot.asdtype(">i8"))
Out[87]:
Instead of reading the values as floating point numbers, we've read them as integers. It's unlikely that you'd ever want to do that, unless the default interpretation is wrong.
One actually useful TBranch reinterpretation is uproot.asarray. It differs from uproot.asdtype only in that the latter creates a new array when reading data while the former fills a user-specified array.
In [88]:
myarray = numpy.zeros(events.numentries, dtype=numpy.float32) # (different size)
reinterpretation = events["E1"].interpretation.toarray(myarray)
reinterpretation
Out[88]:
Passing the new uproot.asarray interpretation to the array-reading function
In [89]:
events["E1"].array(reinterpretation)
Out[89]:
fills and returns that array. When you look at my array object, you can see that it is now filled, overwriting whatever might have been in it before.
In [90]:
myarray
Out[90]:
This is useful for speed-critical applications or ones in which the array is managed by an external system. The array could be NUMA-allocated in a supercomputer or CPU/GPU managed by PyTorch, for instance.
As the provider of the array, it is your responsibility to ensure that it has enough elements to hold the (possibly type-converted) output. (Failure to do so only results in an exception, not a segmentation fault or anything.)
In [91]:
events.arrays(lambda branch: isinstance(branch.interpretation, uproot.asdtype) and
str(branch.interpretation.fromdtype) == ">f8").keys()
Out[91]:
This is because a function that returns objects selects branches and sets their interpretations in one pass.
In [92]:
events.arrays(lambda branch: uproot.asdtype(">f8", "<f4") if branch.name.startswith(b"px") else None)
Out[92]:
The above selects TBranch names that start with "px"
, read-interprets them as big-endian 8-byte floats and writes them as little-endian 4-byte floats. The selector returns None
for the TBranches to exclude and an Interpretation for the ones to reinterpret.
The same could have been said in a less functional way with a dict:
In [93]:
events.arrays({"px1": uproot.asdtype(">f8", "<f4"),
"px2": uproot.asdtype(">f8", "<f4")})
Out[93]:
So far, you've seen a lot of examples with one value per event, but multiple values per event are very common. In the simplest case, the value in each event is a vector, matrix, or tensor with a fixed number of dimensions, such as a 3-vector or a set of parton weights from a Monte Carlo.
Here's an artificial example:
In [94]:
tree = uproot.open("https://scikit-hep.org/uproot/examples/nesteddirs.root")["one/two/tree"]
array = tree.array("ArrayInt64", entrystop=20)
array
Out[94]:
The resulting array has a non-trivial Numpy shape, but otherwise, it has the same Numpy array type as the other arrays you've seen (apart from lazy arrays—ChunkedArray
and VirtualArray
—which are not Numpy objects).
In [95]:
array.shape
Out[95]:
All but the first dimension of the shape parameter (the "length") is known before reading the array: it's the dtype shape.
In [96]:
tree["ArrayInt64"].interpretation
Out[96]:
In [97]:
tree["ArrayInt64"].interpretation.todtype.shape
Out[97]:
The dtype shape of a TBranch with one value per event (simple, 1-dimensional arrays) is an empty tuple.
In [98]:
tree["Int64"].interpretation.todtype.shape
Out[98]:
Fixed-width arrays are exploded into one column per element when viewed as a pandas.DataFrame.
In [99]:
tree.pandas.df("ArrayInt64", entrystop=20)
Out[99]:
Another of ROOT's fundamental TBranch types is a "leaf-list," or a TBranch with multiple TLeaves. (Note: in ROOT terminology, "TBranch" is a data structure that usually points to data in TBaskets and "TLeaf" is the data type descriptor. TBranches and TLeaves have no relationship to the interior and endpoints of a tree structure in computer science.)
The Numpy analogue of a leaf-list is a structured array, a dtype with named fields, which is Numpy's view into a C array of structs (with or without padding).
In [100]:
tree = uproot.open("https://scikit-hep.org/uproot/examples/leaflist.root")["tree"]
array = tree.array("leaflist")
array
Out[100]:
This array is presented as an array of tuples, though it's actually a contiguous block of memory with floating point numbers ("x"
), integers ("y"
), and single characters ("z"
) adjacent to each other.
In [101]:
array[0]
Out[101]:
In [102]:
array["x"]
Out[102]:
In [103]:
array["y"]
Out[103]:
In [104]:
array["z"]
Out[104]:
In [105]:
array.dtype
Out[105]:
In [106]:
array.dtype.itemsize
Out[106]:
ROOT TBranches may have multiple values per event and a leaf-list structure, and Numpy arrays may have non-trivial shape and dtype fields, so the translation between ROOT and Numpy is one-to-one.
Leaf-list TBranches are exploded into one column per field when viewed as a pandas.DataFrame.
In [107]:
tree.pandas.df("leaflist")
Out[107]:
The flatname parameter determines how fixed-width arrays and field names are translated into Pandas names; the default is uproot._connect._pandas.default_flatname
(a function from branchname (str), fieldname (str), index (int) to Pandas column name (str)).
In physics data, it is even more common to have an arbitrary number of values per event than a fixed number of values per event. Consider, for instance, particles produced in a collision, tracks in a jet, hits on a track, etc.
Unlike fixed-width arrays and a fixed number of fields per element, Numpy has no analogue for this type. It is fundamentally outside of Numpy's scope because Numpy describes rectangular tables of data. As we have seen above, Pandas has some support for this so-called "jagged" (sometimes "ragged") data, but only through manipulation of its index (pandas.MultiIndex), not the data themselves.
For this, uproot fills a new JaggedArray
data structure (from the awkward-array library, like ChunkedArray
and VirtualArray
).
In [108]:
tree = uproot.open("https://scikit-hep.org/uproot/examples/nesteddirs.root")["one/two/tree"]
array = tree.array("SliceInt64", entrystop=20)
array
Out[108]:
These JaggedArrays
are made of Numpy arrays and follow the same Numpy slicing rules, including advanced indexing.
Awkward-array generalizes Numpy in many ways—details can be found in its documentation.
In [109]:
array.counts
Out[109]:
In [110]:
array.flatten()
Out[110]:
In [111]:
array[:6]
Out[111]:
In [112]:
array[array.counts > 1, 0]
Out[112]:
Here is an example of JaggedArrays
in physics data:
In [113]:
events2 = uproot.open("https://scikit-hep.org/uproot/examples/HZZ.root")["events"]
In [114]:
E, px, py, pz = events2.arrays(["Muon_E", "Muon_P[xyz]"], outputtype=tuple)
E
Out[114]:
In [115]:
pt = numpy.sqrt(px**2 + py**2)
p = numpy.sqrt(px**2 + py**2 + pz**2)
p
Out[115]:
In [116]:
eta = numpy.log((p + pz)/(p - pz))/2
eta
Out[116]:
In [117]:
phi = numpy.arctan2(py, px)
phi
Out[117]:
In [118]:
pt.counts
Out[118]:
In [119]:
pt.flatten()
Out[119]:
In [120]:
pt[:6]
Out[120]:
Note that if you want to histogram the inner contents of these arrays (i.e. histogram of particles, ignoring event boundaries), functions like numpy.histogram require non-jagged arrays, so flatten them with a call to .flatten()
.
To select elements of inner lists (Pandas's DataFrame.xs), first require the list to have at least that many elements.
In [121]:
pt[pt.counts > 1, 0]
Out[121]:
JaggedArrays
of booleans select from inner lists (i.e. put a cut on particles):
In [122]:
pt > 50
Out[122]:
In [123]:
eta[pt > 50]
Out[123]:
And Numpy arrays of booleans select from outer lists (i.e. put a cut on events):
In [124]:
eta[pt.max() > 50]
Out[124]:
Reducers like count
, sum
, min
, max
, any
(boolean), or all
(boolean) apply per-event, turning a JaggedArray
into a Numpy array.
In [125]:
pt.max()
Out[125]:
You can even do combinatorics, such as a.cross(b)
to compute the Cartesian product of a
and b
per event, or a.choose(n)
to choose n
distinct combinations of elements per event.
In [126]:
pt.choose(2)
Out[126]:
Some of these functions have "arg" versions that return integers, which can be used in indexing.
In [127]:
abs(eta).argmax()
Out[127]:
In [128]:
pairs = pt.argchoose(2)
pairs
Out[128]:
In [129]:
left = pairs.i0
right = pairs.i1
left, right
Out[129]:
Masses of unique pairs of muons, for events that have them:
In [130]:
masses = numpy.sqrt((E[left] + E[right])**2 - (px[left] + px[right])**2 -
(py[left] + py[right])**2 - (pz[left] + pz[right])**2)
masses
Out[130]:
In [131]:
counts, edges = numpy.histogram(masses.flatten(), bins=120, range=(0, 120))
matplotlib.pyplot.step(x=edges, y=numpy.append(counts, 0), where="post");
matplotlib.pyplot.xlim(edges[0], edges[-1]);
matplotlib.pyplot.ylim(0, counts.max() * 1.1);
matplotlib.pyplot.xlabel("mass");
matplotlib.pyplot.ylabel("events per bin");
JaggedArrays
are compact in memory and fast to read. Whereas root_numpy reads data like std::vector<float>
per event into a Numpy array of Numpy arrays (Numpy's object "O"
dtype), which has data locality issues, JaggedArray
consists of two contiguous arrays: one containing content (the floats
) and the other representing structure via offsets
(random access) or counts
.
In [132]:
masses.content
Out[132]:
In [133]:
masses.offsets
Out[133]:
In [134]:
masses.counts
Out[134]:
Fortunately, ROOT files are themselves structured this way, with variable-width data represented by contents and offsets in a TBasket. These arrays do not need to be deserialized individually, but can be merely cast as Numpy arrays in one Python call. The lack of per-event processing is why reading in uproot and processing data with awkward-array can be fast, despite being written in Python.
Although any C++ type can in principle be read (see below), some are important enough to be given convenience methods for analysis. These are not defined in uproot (which is strictly concerned with I/O), but in uproot-methods. If you need certain classes to have user-friendly methods in Python, you're encouraged to contribute them to uproot-methods.
One of these classes is TLorentzVectorArray
, which defines an array of Lorentz vectors.
In [135]:
events3 = uproot.open("https://scikit-hep.org/uproot/examples/HZZ-objects.root")["events"]
In [136]:
muons = events3.array("muonp4")
muons
Out[136]:
In the print-out, these appear to be Python objects, but they're high-performance arrays that are only turned into objects when you look at individuals.
In [137]:
muon = muons[0, 0]
type(muon), muon
Out[137]:
This object has all the usual kinematics methods,
In [138]:
muon.mass
Out[138]:
In [139]:
muons[0, 0].delta_phi(muons[0, 1])
Out[139]:
But an array of Lorentz vectors also has these methods, and they are computed in bulk (faster than creating each object and calling the method on each).
In [140]:
muons.mass # some mass**2 are slightly negative, hence the Numpy warning about negative square roots
Out[140]:
(Note: if you don't want to see Numpy warnings, use numpy.seterr.)
In [141]:
pairs = muons.choose(2)
lefts = pairs.i0
rights = pairs.i1
lefts.delta_r(rights)
Out[141]:
TBranches with C++ class TLorentzVector
are automatically converted into TLorentzVectorArrays
. Although they're in wide use, the C++ TLorentzVector
class is deprecated in favor of ROOT::Math::LorentzVector. Unlike the old class, the new vectors can be represented with a variety of data types and coordinate systems, and they're split into multiple branches, so uproot sees them as four branches, each representing the components.
You can still use the TLorentzVectorArray
Python class; you just need to use a special constructor to build the object from its branches.
In [142]:
# Suppose you have four component branches...
E, px, py, pz = events2.arrays(["Muon_E", "Muon_P[xyz]"], outputtype=tuple)
In [143]:
import uproot_methods
array = uproot_methods.TLorentzVectorArray.from_cartesian(px, py, pz, E)
array
Out[143]:
There are constructors for different coordinate systems. Internally, TLorentzVectorArray
uses the coordinates you give it and only converts to other systems on demand.
In [144]:
[x for x in dir(uproot_methods.TLorentzVectorArray) if x.startswith("from_")]
Out[144]:
In [145]:
branch = uproot.open("https://scikit-hep.org/uproot/examples/sample-6.14.00-zlib.root")["sample"]["str"]
branch.array()
Out[145]:
As with most strings from ROOT, they are unencoded bytestrings (see the b
before each quote). Since they're not names, there's no namedecode, but they can be decoded as needed using the usual Python method.
In [146]:
[x.decode("utf-8") for x in branch.array()]
Out[146]:
Uproot does not have a hard-coded deserialization for every C++ class type; it uses the "streamers" that ROOT includes in each file to learn how to deserialize the objects in that file. Even if you defined your own C++ classes, uproot should be able to read them. (Caveat: not all structure types have been implemented, so the coverage of C++ types is a work in progress.)
In some cases, the deserialization is simplified by the fact that ROOT has "split" the objects. Instead of seeing a JaggedArray
of objects, you see a JaggedArray
of each attribute separately, such as the components of a ROOT::Math::LorentzVector.
In the example below, Track
objects under fTracks
have been split into fTracks.fUniqueID
, fTracks.fBits
, fTracks.fPx
, fTracks.fPy
, fTracks.fPz
, etc.
In [150]:
!wget https://scikit-hep.org/uproot/examples/Event.root
In [151]:
tree = uproot.open("Event.root")["T"]
tree.show()
In this view, many of the attributes are not special classes and can be read as arrays of numbers,
In [152]:
tree.array("fTemperature", entrystop=20)
Out[152]:
as arrays of fixed-width matrices,
In [153]:
tree.array("fMatrix[4][4]", entrystop=6)
Out[153]:
as jagged arrays (of ROOT's "Float16_t" encoding),
In [154]:
tree.array("fTracks.fMass2", entrystop=6)
Out[154]:
or as jagged arrays of fixed arrays (of ROOT's "Double32_t" encoding),
In [155]:
tree.array("fTracks.fTArray[3]", entrystop=6)
Out[155]:
However, some types are not fully split by ROOT and have to be deserialized individually (not vectorally). This example includes histograms in the TTree, and histograms are sufficiently complex that they cannot be split.
In [156]:
tree.array("fH", entrystop=6)
Out[156]:
Each of those is a standard histogram object, something that would ordinarily be in a TDirectory
, not a TTree
. It has histogram convenience methods (see below).
In [157]:
for histogram in tree.array("fH", entrystop=3):
print(histogram.title)
print(histogram.values)
print("\n...\n")
for histogram in tree.array("fH", entrystart=-3):
print(histogram.title)
print(histogram.values)
The criteria for whether an object can be read vectorially in Numpy (fast) or individually in Python (slow) is whether it has a fixed width—all objects having the same number of bytes—or a variable width. You can see this in the TBranch's interpretation
as the distinction between uproot.asobj (fixed width, vector read) and uproot.asgenobj (variable width, read into Python objects).
In [158]:
# TLorentzVectors all have the same number of fixed width components, so they can be read vectorially.
events3["muonp4"].interpretation
Out[158]:
In [159]:
# Histograms contain name strings and variable length lists, so they must be read as Python objects.
tree["fH"].interpretation
Out[159]:
std::vector<std::vector<T>>
)Variable length lists are an exception to the above—up to one level of depth. This is why JaggedArrays
, representing types such as std::vector<T>
for a fixed-width T
, can be read vectorially. Unfortunately, the same does not apply to doubly nested jagged arrays, such as std::vector<std::vector<T>>
.
In [160]:
branch = uproot.open("https://scikit-hep.org/uproot/examples/vectorVectorDouble.root")["t"]["x"]
branch.interpretation
Out[160]:
In [161]:
branch._streamer._fTypeName
Out[161]:
In [162]:
array = branch.array()
array
Out[162]:
Although you see something that looks like a JaggedArray
, the type is ObjectArray
, meaning that you only have some bytes with an auto-generated prescription for turning them into Python objects (from the "streamers," self-describing the ROOT file). You can't apply the usual JaggedArray
slicing.
In [163]:
try:
array[array.counts > 0, 0]
except Exception as err:
print(type(err), err)
To get JaggedArray
semantics, use awkward.fromiter
to convert the arbitrary Python objects into awkward-arrays.
In [164]:
jagged = awkward.fromiter(array)
jagged
Out[164]:
In [165]:
jagged[jagged.counts > 0, 0]
Out[165]:
Doubly nested JaggedArrays
are a native type in awkward-array: they can be any number of levels deep.
In [166]:
jagged.flatten()
Out[166]:
In [167]:
jagged.flatten().flatten()
Out[167]:
In [168]:
jagged.sum()
Out[168]:
In [169]:
jagged.sum().sum()
Out[169]:
Uproot supports reading, deserialization, and array-building in parallel. All of the array-reading functions have executor and blocking parameters:
True
(default), the array-reading function blocks (waits) until the result is ready, then returns it. If False
, it immediately returns a zero-argument function that, when called, blocks until the result is ready. This zero-argument function is a simple type of "future."
In [170]:
import concurrent.futures
# ThreadPoolExecutor divides work among multiple threads.
# Avoid ProcessPoolExecutor because the finalized arrays would have to be reserialized to pass between processes.
executor = concurrent.futures.ThreadPoolExecutor()
result = tree.array("fTracks.fVertex[3]", executor=executor, blocking=False)
result
Out[170]:
We can work on other things while the array is being read.
In [171]:
# and now get the array (waiting, if necessary, for it to complete)
result()
Out[171]:
The executor and blocking parameters are often used together, but they do not have to be. You can collect data in parallel but let the array-reading function block until it is finished:
In [172]:
tree.array("fTracks.fVertex[3]", executor=executor)
Out[172]:
The other case, non-blocking return without parallel processing (executor=None and blocking=False) is not very useful because all the work of creating the array would be done on the main thread (meaning: you have to wait) and then you would be returned a zero-argument function to reveal it.
Although parallel processing has been integrated into uproot's design, it only provides a performance improvement in cases that are dominated by read time in non-Python functions. Python's Global Interpreter Lock (GIL) severely limits parallel scaling of Python calls, but external functions that release the GIL (not all do) are immune.
Thus, if reading is slow because the ROOT file has a lot of small TBaskets, requiring uproot to step through them using Python calls, parallelizing that work in many threads has limited benefit because those threads stop and wait for each other due to Python's GIL. If reading is slow because the ROOT file is heavily compressed—for instance, with LZMA—then parallel reading is beneficial and scales well with the number of threads.
If, on the other other hand, processing time is dominated by your analysis code and not file-reading, then parallelizing the file-reading won't help. Instead, you want to parallelize your whole analysis, and a good way to do that in Python is with multiprocessing from the Python Standard Library.
If you do split your analysis into multiple processes, you probably don't want to also parallelize the array-reading within each process. It's easy to make performance worse by making it too complicated. Particle physics analysis is usually embarrassingly parallel, well suited to splitting the work into independent tasks, each of which is single-threaded.
Another option, of course, is to use a batch system (Condor, Slurm, GRID, etc.). It can be advantageous to parallelize your work across machines with a batch system and across CPU cores with multiprocessing.
TTrees are not the only kinds of objects to analyze in ROOT files; we are also interested in aggregated data in histograms, profiles, and graphs. Uproot uses the ROOT file's "streamers" to learn how to deserialize any object, but an anonymous deserialization often isn't useful:
In [179]:
file = uproot.open("Event.root")
dict(file.classes())
Out[179]:
In [180]:
processid = file["ProcessID0"]
processid
Out[180]:
What is a TProcessID
?
In [181]:
processid._members()
Out[181]:
Something with an fName
and fTitle
...
In [182]:
processid._fName, processid._fTitle # note the underscore; these are private members
Out[182]:
Some C++ classes have Pythonic overloads to make them more useful in Python. Here's a way to find out which ones have been defined so far:
In [183]:
import pkgutil
[modname for importer, modname, ispkg in pkgutil.walk_packages(uproot_methods.classes.__path__)]
Out[183]:
This file contains TH1F
objects, which is a subclass of TH1
. The TH1
methods will extend it.
In [184]:
file["htime"].edges
Out[184]:
In [185]:
file["htime"].values
Out[185]:
In [186]:
file["htime"].show()
The purpose of most of these methods is to extract data, which includes conversion to common Python formats.
In [187]:
uproot.open("https://scikit-hep.org/uproot/examples/issue33.root")["cutflow"].show()
In [188]:
file["htime"].pandas()
Out[188]:
In [189]:
print(file["htime"].hepdata())
Numpy histograms, used as a common format through the scientific Python ecosystem, are just a tuple of counts/bin contents and edge positions. (There's one more edge than contents to cover left and right.)
In [190]:
file["htime"].numpy()
Out[190]:
In [192]:
uproot.open("samples/hepdata-example.root")["hpxpy"].numpy()
Out[192]:
Uproot has a limited (but growing!) ability to write ROOT files. Two types currently supported are TObjString
(for debugging) and histograms.
To write to a ROOT file in uproot, the file must be opened for writing using uproot.create
, uproot.recreate
, or uproot.update
(corresponding to ROOT's "CREATE"
, "RECREATE"
, and "UPDATE"
file modes). The compression level is given by uproot.ZLIB(n)
, uproot.LZMA(n)
, uproot.LZ4(n)
, or None
.
In [193]:
file = uproot.recreate("tmp.root", compression=uproot.ZLIB(4))
Unlike objects created by uproot.open, you can assign to this file
. Just as reading behaves like getting an object from a Python dict, writing behaves like putting an object into a Python dict.
Note: this is a fundamental departure from how ROOT uses names. In ROOT, a name is a part of an object that is also used for lookup. With a dict-like interface, the object need not have a name; only the lookup mechanism (e.g. ROOTDirectory) needs to manage names.
When you write objects to the ROOT file, they can be unnamed things like a Python string, but they get "stamped" with the lookup name once they go into the file.
In [194]:
file["name"] = "Some object, like a TObjString."
The object is now in the file. ROOT would be able to open this file and read the data, like this:
root [0] auto file = TFile::Open("tmp.root");
root [1] file->ls();
TFile** tmp.root
TFile* tmp.root
KEY: TObjString name;1 Collectable string class
root [2] TObjString* data;
root [3] file->GetObject("name", data);
root [4] data->GetString()
(const TString &) "Some object, like a TObjString."[31]
We can also read it back in uproot, like this:
In [195]:
file.keys()
Out[195]:
In [196]:
dict(file.classes())
Out[196]:
In [197]:
file["name"]
Out[197]:
(Notice that it lost its encoding—it is now a bytestring.)
In [198]:
histogram = uproot.open("https://scikit-hep.org/uproot/examples/histograms.root")["one"]
histogram.show()
norm = histogram.allvalues.sum()
for i in range(len(histogram)):
histogram[i] /= norm
histogram.show()
file["normalized"] = histogram
or it may be created entirely in Python.
In [199]:
import types
import uproot_methods.classes.TH1
class MyTH1(uproot_methods.classes.TH1.Methods, list):
def __init__(self, low, high, values, title=""):
self._fXaxis = types.SimpleNamespace()
self._fXaxis._fNbins = len(values)
self._fXaxis._fXmin = low
self._fXaxis._fXmax = high
for x in values:
self.append(float(x))
self._fTitle = title
self._classname = "TH1F"
histogram = MyTH1(-5, 5, [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0])
file["synthetic"] = histogram
In [200]:
file["synthetic"].show()
But it is particularly useful that uproot recognizes Numpy histograms, which may have come from other libraries.
In [201]:
file["from_numpy"] = numpy.histogram(numpy.random.normal(0, 1, 10000))
In [202]:
file["from_numpy"].show()
In [203]:
file["from_numpy2d"] = numpy.histogram2d(numpy.random.normal(0, 1, 10000), numpy.random.normal(0, 1, 10000))
In [204]:
file["from_numpy2d"].numpy()
Out[204]:
Uproot can now write TTrees (documented on the main README), but the interactive tutorial has not been written.