Examples dump: StackOverflow

This is a selection of StackOverflow question/answer pairs that

are tagged [awkward-array] OR [uproot]
refer to modern (Awkward 1) software versions
have significant code examples in the answers.

It's for mining for documentation-writing.

How to use uproot to load referenced values (TRefArray)

URL: https://stackoverflow.com/questions/58291817/how-to-use-uproot-to-load-referenced-values-trefarray

Question:

I am trying to use uproot to do some basic selection from a Delphes .root output file. Delphes's c++ code examples are looping over events and accessing reconstructed BranchElements, which have methods to access the branch elements belonging to various classes.

e.g. The root file contains a <TBranchElement b'Jet' at 0x7fb36526b3c8> that (in c++) Delphes example code can use to get object=jet->Constituents.At(i) in a "for" loop and then if this object is a object->IsA() == Tower::Class then one calls object->P4() to get the 4momentum. So, whereas using uproot one could only separately get the two values, in Delphes examples one uses the Jet class to get access to the Tower class (from which the Jet was reconstructed) using a method.

The information I see is:

    Jet_size                   (no streamer)              asdtype('>i4')
    Jet                        TStreamerInfo              asdtype('>i4')
    Jet.fUniqueID              TStreamerBasicType         asjagged(asdtype('>u4'))
.
.
.
    Jet.Constituents           TStreamerInfo              asgenobj(SimpleArray(TRefArray))


<TBranchElement b'Jet' at 0x7fb3657825f8>

<TBranchElement b'Jet.Constituents' at 0x7fb47840cba8>

For uproot, if one loads the TBranchElement as an array, then there's only access to the array elements in Jet.Constituents[i], which are lists of numbers. How would I be able to load Jet.Constituents in a way that refers to the Tower.PT (or eta,phi etc.) values that it contains?

Answer:

If you have an array of TRefs, you can use them directly as an integer index on the other collection. (See this tutorial, starting on In[29], for a general introduction to indexing by integer arrays, both in Numpy and Awkward Array.)

That is, if you have an array of TRef, as in this example,

import uproot
t = uproot.open("issue324.root")["Delphes"]
refs = t["Track.Particle"].array()
refs.id
# <JaggedArray [
#      [752 766 780 ... 1813 1367 1666]
#      ...
#      [745 762 783 ... 1863 1713 1717]]>

gives you the indexes and

pt = t["Particle.PT"].array()

the array you want to reference, so

pt[refs.id - 1]
# <JaggedArray [
#      [0.7637838 1.1044897 5.463864 ... 4.252923 1.9702696 9.213475]
#      ...
#      [1.2523094 0.37887865 0.7390242 ... 1.0288503 3.4785874 1.804613]]>

selects the pt values of interest (correcting for the fact that these indexes start with 1 and Python indexes start with 0).

If you have an array of TRefArray as in this example,

t["Tower.Particles"].array()
# <ObjectArray [[[1414, 1418, 1471, 1571], [1447], [1572],
#               ...,
#               [864, 1325], [992, 1437], [1262, 1501]]]>

it's actually an ObjectArray that generates sub-arrays from the data on demand (because ROOT doesn't store doubly jagged data natively). You can convert them to native JaggedArrays by calling awkward.fromiter on them:

import awkward
a = awkward.fromiter(t["Tower.Particles"].array())
# <JaggedArray [[[1414 1418 1471 1571] [1447] [1572] 
#                ...
#                [864 1325] [992 1437] [1262 1501]]]>

and then use these doubly jagged indexes in any doubly jagged collection (in which all the numbers of elements line up, as they would for the collection you're referencing).

Trouble obtaining data from multiple *.root files...but no problems using only one

URL: https://stackoverflow.com/questions/58656761/trouble-obtaining-data-from-multiple-root-files-but-no-problems-using-only-o

Question:

I am using pythong version 3.6.5 and have a jagged TTree with a multi-dimensional structure. This data is spread over more than 1000 files, all with the same identical TTree structure.

suppose I have two files though and I'll call them fname1.root fname2.root

The following code has no problem opening either of these by itself:

import uproot as upr
import awkward
import boost_histogram as bh
import math
import matplotlib.pyplot as plt
#
# define a plotting program
# def plotter(h)
#
# preparing the file location for files
pth = '/fullpathName/'
fname1 = 'File755.root'
fname2 = 'File756.root'
fileList = [pth+fname1, pth+fname2]
#
# print out the path and filename that I've made to show the user
for file in fileList:
    print(file)
print('\n')
#
# Let's make a histogram This one has 50 bins, starts at zero and ends at 1000.0
# It will be a histogram of Jet pT's. 
jhist = bh.histogram(bh.axis.regular(50,0.0,1000.0))
#
#show what you've just done
print(jhist)
#
# does not work, only fills first file!
for chunk in upr.iterate(fileList,"bTag_AntiKt4EMTopoJets",["jet_pt"]):
    jhist.fill(chunk[b"jet_pt"][:, :2].flatten()*0.001)
#
#
# what does my histogram look like?
ptHist = plt.bar(jhist.axes[0].centers, jhist.view(), width=jhist.axes[0].widths)
plt.show()

As I said, the above code works if I put only ONE file in 'fileList'.

The naive thing to do doesn't work. If I create a 'list' of files using

files = [pth+fname1 , pth+fname2]

and re-run that code. I get the following error...which is very much the same error I have been getting all along.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 48, in <module>
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 116, in iterate
    for tree, branchesinterp, globalentrystart, thispath, thisfile in _iterate(path, treepath, branches, awkward, localsource, xrootdsource, httpsource, **options):
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 163, in _iterate
    file = uproot.rootio.open(path, localsource=localsource, xrootdsource=xrootdsource, httpsource=httpsource, **options)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 54, in open
    return ROOTDirectory.read(openfcn(path), **options)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 51, in <lambda>
    openfcn = lambda path: MemmapSource(path, **kwargs)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/source/memmap.py", line 21, in __init__
    self._source = numpy.memmap(self.path, dtype=numpy.uint8, mode="r")
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_94python3/x86_64-slc6-gcc8-opt/lib/python3.6/site-packages/numpy/core/memmap.py", line 264, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 12] Cannot allocate memory

Answer:

Lazy arrays are just an interface for convenience—i.e. you can transform it with one function call, rather than iterating in an explicit loop over chunks. Internally, lazy arrays contain an implicit loop over chunks, so if you're running out of memory one way, you would be the other way.

Your problem is not closing files (they're memory-mapped, so "closing" didn't have a clear meaning—they're a view into memory that the operating system is allocating for itself, anyway)—your problem is with deleting arrays. That's the only thing that can use up all the memory on your computer.

There are a few things you can do here: one is to

for chunk in uproot.iterate(files, "bTag_AntiKt4EMTopoJets", ["jet_pt", "jet_eta"]):
    # fill with chunk[b"jet_pt"] and chunk[b"jet_eta"], which correspond
    # to the same sets of events, one-to-one.

to explicitly loop over the chunks ("explicit" because you see and control the loop here, and because you have to specify which branches you want to load into the dict chunk). You can control the size of the chunks with entrysteps. The other is to

cache = uproot.ArrayCache("1 GB")
events = uproot.lazyarrays(files, "bTag_AntiKt4EMTopoJets", cache=cache)

to keep the loop implicit. The ArrayCache will throw out chunks of arrays, so that they have to be loaded again, if it gets to the 1 GB limit. If you make that limit too small, it won't be able to hold one chunk, but if you make it too large, you'll run out of memory.

By the way, although you're reporting a memory issue, there's another major performance issue with your code: you're looking over events in Python. Instead of

events.jet_pt[i][:2]*0.001

to get the jet pT for event i, do

events.jet_pt[:, :2]*0.001

for the jet pT of all events as a single array. You might then need to .flatten() that array to fit the histogram's fill method.

How to get uproot.iterate() output identical to the root_numpy root2array() output with jagged arrays

URL: https://stackoverflow.com/questions/60680991/how-to-get-uproot-iterate-output-identical-to-the-root-numpy-root2array-outp

Question:

I know that there is a solution for the similar question: https://stackoverflow.com/questions/58670742/how-to-get-uproot-iterate-output-like-the-root-numpy-root2array-output-fast But as I understand it is suitable for the flat ROOT TTrees only. I want to have generalized solution for:

fixed-size dimension but nested data like particle momentum (px, py, pz) which are represented in the ROOT TTree as the vector<double>
arbitrary-size dimension data

All my attempts to apply asjagged for it failed. Is it possible to avoid jaggedarray for case (1)?

Answer:

If the data are fixed-size but stored as vector<double>, then they're treated as though they were not fixed-size. Uproot would always read them as jagged arrays, and therefore the asarray method described in the other question is unavailable.

That said, if you have more knowledge than your file's metadata and are willing to try an unsupported hack, you can force the interpretation of your vector<double> to always have three elements. Here's an example—first, we need a suitable ROOT file:

TFile *f = new TFile("tmp.root", "recreate");
TTree *t = new TTree("t", "t");
std::vector<double> x;
t->Branch("x", &x);
x = {101, 102, 103};
t->Fill();
x = {201, 202, 203};
t->Fill();
x = {301, 302, 303};
t->Fill();
x = {401, 402, 403};
t->Fill();
x = {501, 502, 503};
t->Fill();
x = {601, 602, 603};
t->Fill();
t->Write();
f->Close();

Now, the normal way to read this in Uproot is with a JaggedArray:

>>> import uproot
>>> branch = uproot.open("tmp.root")["t"]["x"]
>>> branch.array()
<JaggedArray [[101.0 102.0 103.0] [201.0 202.0 203.0] [301.0 302.0 303.0] [401.0 402.0 403.0] [501.0 502.0 503.0] [601.0 602.0 603.0]] at 0x7f325ab4eb90>
>>> branch.interpretation
asjagged(asdtype('>f8'), 10)

But this allocates new arrays every time you read it and does some manipulation to find the borders, which you know are regular. (And Uproot does not know that.)

The header for an STL vector happens to be 10 bytes long. After that, its contents are serialized in order, from first to last, before moving on to the next STL vector. For three 8-byte floating point numbers, that's 10 + 8 + 8 + 8 = 34 bytes, always the same size. The fact that it always happens to be the same size is critical for the following.

A NumPy structured array can represent arrays of fixed-size structs as a dtype:

>>> array = branch.array(uproot.asdtype([("header", "S10"), ("x", ">f8"), ("y", ">f8"), ("z", ">f8")]))
>>> array
array([(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 101., 102., 103.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 201., 202., 203.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 301., 302., 303.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 401., 402., 403.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 501., 502., 503.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 601., 602., 603.)],
      dtype=[('header', 'S10'), ('x', '<f8'), ('y', '<f8'), ('z', '<f8')])

We don't like looking at the header, so we can strip it out using ordinary NumPy slicing:

>>> array[["x", "y", "z"]]
array([(101., 102., 103.), (201., 202., 203.), (301., 302., 303.),
       (401., 402., 403.), (501., 502., 503.), (601., 602., 603.)],
      dtype={'names':['x','y','z'], 'formats':['<f8','<f8','<f8'], 'offsets':[10,18,26], 'itemsize':34})

(or just ask for array["x"] to get a non-structured array). Since we can do that with a plain uproot.asdtype, we can do it with a preallocated uproot.asarray.

>>> import numpy as np
>>> big = np.empty(1000, dtype=[("header", "S10"), ("x", ">f8"), ("y", ">f8"), ("z", ">f8")])
>>> branch.array(uproot.asarray(big.dtype, big))
array([(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 101., 102., 103.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 201., 202., 203.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 301., 302., 303.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 401., 402., 403.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 501., 502., 503.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 601., 602., 603.)],
      dtype=[('header', 'S10'), ('x', '>f8'), ('y', '>f8'), ('z', '>f8')])

Above, big has to be at least as large as the dataset you want to read and reading it returns a trimmed view of that array. It does not allocate a new array—the data are stored in big:

>>> big[["x", "y", "z"]][:10]
array([( 1.01000000e+002,  1.02000000e+002,  1.03000000e+002),
       ( 2.01000000e+002,  2.02000000e+002,  2.03000000e+002),
       ( 3.01000000e+002,  3.02000000e+002,  3.03000000e+002),
       ( 4.01000000e+002,  4.02000000e+002,  4.03000000e+002),
       ( 5.01000000e+002,  5.02000000e+002,  5.03000000e+002),
       ( 6.01000000e+002,  6.02000000e+002,  6.03000000e+002),
       ( 1.22164945e-309,  5.26335088e-310,  1.22167067e-309),
       (-5.70498984e+158,  5.97958175e+158, -5.97958175e+158),
       (-4.92033505e+032, -4.92033505e+032, -4.92033505e+032),
       ( 3.77957352e+011,  3.77957221e+011,  3.77957320e+011)],
      dtype={'names':['x','y','z'], 'formats':['>f8','>f8','>f8'], 'offsets':[10,18,26], 'itemsize':34})

Everything beyond the 6 events that were read is uninitialized junk, so I'd recommend using the trimmed output from the branch.array function; this is just to show that big is being filled—you're not getting a new array.

As per your question (2), if the data are not regular (in number of bytes per entry), then there's no way you can do this. Also, remember that the above technique is not officially supported: you have to know that your data are regular and you have to know that STL vectors have a 10-byte header, which are not things we expect most users to know.

Copy TTree from file into a new file with uproot

URL: https://stackoverflow.com/questions/61273669/copy-ttree-from-file-into-a-new-file-with-uproot

Question:

I'm new to uproot and I am trying to achieve a fairly simply task, but I'm not sure how to do this. Essentially, I have a root file that contains a bunch of histograms and one TTree that is made up of 8 branches for roughly 4 million entries.

What I need to do, I make a new root file, and copy 80% of the TTree from the original file into a TTree (called training) and the remaining 20% into a second TTree in the same new file (called test).

What I have tried is making a directory in python into which I read all of the data from the original file branch by branch. I then used this directory to write the data into the two new TTrees.

This is kind of working, I am getting a file with the structure that I wanted, I'm not entirely satisfied for two reasons:

Surely there has to be a more direct way? First reading the data into python and then writing it into a file seems extremely cumbersome and memory intensive.
I am honestly not very experienced with root, but from the way I understand it, in my original file, I have a tree that contains my 4 million events. Each event has a value for each branch, so when I say, 'get me entry 555!', I get 8 values (1 for each branch). If I just copy the branches the way I am doing, do I lose this structure or does the index of all the arrays in my directory replace the entry number? So, grabbing the vales from all arrays @ index 555 was the same as returning entry 555 before?

Any help would be welcome. Thanks!

Answer:

This task would always involve reading into memory and writing back out, whether those arrays are in the user's control or hidden.

There's one possible exception, if you want to read TBaskets from one file and write them to another without decompressing them—then they're still in memory but not decompressed and that can be a performance boost. ROOT can do this as a "fast copy," but Uproot doesn't have an equivalent. Such a copy would require that you don't want to modify the data in the TBaskets in any way, including slicing at arbitrary event boundaries, which can be an issue if the TBaskets for the 8 TBranches you're interested in don't line up at common event boundaries. (Such a feature could be added to Uproot—there's no technical limitation, but this feature is only useful in certain cases.)

So the process of reading arrays out of one file and writing them into another is about as good as it gets, with the above caveat.

I am unsure what you mean by a "directory in Python."

To answer your second question, the arrays that are read out of a TTree are aligned in the sense that entry 555 of one TBranch belongs to the same event as entry 555 of another TBranch. This is a common way of working with sets of arrays in NumPy, though it's an uncommon way of working with ROOT data; in ROOT, an event is an object or at least you don't see more than one object at a time.

If you have memory issues (probably not with 8 TBranches × 4 million events, not jagged, = 244 MB of RAM if double precision), then you can consider iterating:

numtraining = int(0.8*ttree.numentries)
numtest = ttree.numentries - numtraining

for chunk in ttree.iterate("*", entrysteps="1 GB", entrystop=numtraining):
    training.extend(chunk)

for chunk in ttree.iterate("*", entrysteps="1 GB", entrystart=numtraining):
    test.extend(chunk)

This gives you control over the size of your output TBaskets, since each TBranch gets one TBasket per call to extend. The above example ensures that a set of TBranches that all have to be used together collectively consist of at most 1 GB.

Unlike a "fast copy" (see above), what you're doing is not just copying, it's also repartitioning the data, which can improve performance when you read those output files. Generally, larger chunks (larger TBaskets) are faster to read, but too large and they can require too much memory.

Plotting with different length of jagged arrays

URL: https://stackoverflow.com/questions/64050783/plotting-with-different-length-of-jagged-arrays

Question:

I have a problem when trying to plot 2d histogram or graph with different length of jagged arrays.

Here is a simple example. Suppose there are 7 events of gen-level pT and its Et.

pT = [ [46.8], [31.7], [21], [29.9], [13.9], [41.2], [15.7] ]
Et = [ [41.4], [25.5, 20], [19.6], [27.4], [12, 3.47], [37.8], [10] ]

Here, some events (2nd, 5th) have two y values corresponding one x value. I want to make graph or 2d histogram putting x = pt and y = et, and put two points together. i.e (31.7, 25.5) and (31.7, 20)

How can I make align these values for plotting?

Answer:

What you want to do is "broadcast" the two arrays:

Awkward broadcasting is a generalization of NumPy broadcasting to include variable-length lists.

Broadcasting usually happens automatically when you're performing a mathematical calculation:

>>> import awkward1 as ak
>>> ak.Array([[1, 2, 3], [], [4, 5]]) + ak.Array([100, 200, 300])
<Array [[101, 102, 103], [], [304, 305]] type='3 * var * int64'>

but you can also do it manually:

>>> ak.broadcast_arrays(ak.Array([[1, 2, 3], [], [4, 5]]),
...                     ak.Array([100, 200, 300]))
[<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>,
 <Array [[100, 100, 100], [], [300, 300]] type='3 * var * int64'>]

When the two arrays have different depths (different "dimensions" in NumPy terminology), scalars from one are replicated to align with all elements of lists in the other.

You have two lists of the same depth:

>>> pT = ak.Array([ [46.8], [31.7], [21], [29.9], [13.9], [41.2], [15.7] ])
>>> Et = ak.Array([ [41.4], [25.5, 20], [19.6], [27.4], [12, 3.47], [37.8], [10] ])

To manually broadcast them, you could reduce the depth of pT by taking the first element from each list.

>>> pT[:, 0]
<Array [46.8, 31.7, 21, ... 13.9, 41.2, 15.7] type='7 * float64'>

Then you can broadcast each scalar of pT into each list of Et.

>> ak.broadcast_arrays(pT[:, 0], Et)
[<Array [[46.8], [31.7, 31.7, ... 41.2], [15.7]] type='7 * var * float64'>,
 <Array [[41.4], [25.5, 20], ... [37.8], [10]] type='7 * var * float64'>]

This will be more clear if I print them in their entirety by turning them into Python lists:

>>> pT_broadcasted, Et = ak.broadcast_arrays(pT[:, 0], Et)
>>> pT_broadcasted.tolist()
[[46.8], [31.7, 31.7], [21.0], [29.9], [13.9, 13.9], [41.2], [15.7]]
>>> Et.tolist()
[[41.4], [25.5, 20.0], [19.6], [27.4], [12.0, 3.47], [37.8], [10.0]]

Now you see that the 31.7 has been duplicated to align with each value in [25.5, 20.0].

In NumPy, you'll often see examples of broadcasting a dimension of length 1, rather than creating a dimension, like this:

>>> import numpy as np
>>> np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) + np.array([[100], [200], [300]])
array([[101, 102, 103],
       [204, 205, 206],
       [307, 308, 309]])

Awkward Array follows this rule, but only if the dimension has length "exactly 1," not "a bunch of variable-length lists that happen to each have length 1." The way I've written pT, it has the latter:

>>> ak.type(pT)     # 7 lists with variable length
7 * var * float64
>>> ak.num(pT)      # they happen to each have length 1... this time...
<Array [1, 1, 1, 1, 1, 1, 1] type='7 * int64'>

Since these lists are in-principle variable, they don't broadcast the way that length-1 NumPy arrays would.

>>> ak.broadcast_arrays(pT, Et)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-1.0/awkward1/operations/structure.py", line 699, in broadcast_arrays
    out = awkward1._util.broadcast_and_apply(inputs, getfunction, behavior)
  File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 972, in broadcast_and_apply
    out = apply(broadcast_pack(inputs, isscalar), 0)
  File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 745, in apply
    outcontent = apply(nextinputs, depth + 1)
  File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 786, in apply
    nextinputs.append(x.broadcast_tooffsets64(offsets).content)
ValueError: in ListOffsetArray64, cannot broadcast nested list

(https://github.com/scikit-hep/awkward-1.0/blob/0.3.2/src/cpu-kernels/operations.cpp#L778)

If you explicitly cast the array as NumPy, it will have regular types. (Note to self: it would be nice to have a way to turn a variable-length dimension regular or vice-versa without converting the whole array to NumPy.)

>>> ak.type(pT)
7 * var * float64
>>> ak.type(ak.to_numpy(pT))
7 * 1 * float64

So an alternate way to get the same broadcasting is to convert pT to NumPy, instead of picking out the first element of each list with pT[:, 0].

>>> ak.broadcast_arrays(ak.to_numpy(pT), Et)
[<Array [[46.8], [31.7, 31.7, ... 41.2], [15.7]] type='7 * var * float64'>,
 <Array [[41.4], [25.5, 20], ... [37.8], [10]] type='7 * var * float64'>]

Either way, an assumption is being made that pT consists of lists of length 1. The pT[:, 0] expression assumes this because it requires something to have index 0 in each list (so the length is at least 1) and it ignores whatever else might be there. The ak.to_numpy(pT) expression will raise an exception if the pT array doesn't happen to be regular, a shape that can be expressed in NumPy.

Now that you have pT_broadcasted and Et aligned with the same structure, you'll have to flatten them both to pass them to a plotting routine (which expects non-jagged data).

>>> ak.flatten(pT_broadcasted), ak.flatten(Et)
(<Array [46.8, 31.7, 31.7, ... 13.9, 41.2, 15.7] type='9 * float64'>,
 <Array [41.4, 25.5, 20, ... 3.47, 37.8, 10] type='9 * float64'>)

The plotting routine will probably try np.asarray on each of these, which is identical to ak.to_numpy, which will work because these flattened arrays are regular. If you have doubly jagged data or something more complex, you'd have to flatten more.

removing none from an awkward array

URL: https://stackoverflow.com/questions/64308530/removing-none-from-an-awkward-array

Question:

I have an awkward array (1) which I obtained post-processing.

An array look like:

>>> ak.Array([96., 99., 67., 13.,  3.,  None, 1.,  1.,None])

I want to remove the None elements from this array. I could remove them using loop, but I want to avoid it to save some computing time. Or writing a function and compiling using Numba is only option?

Thanks.

Answer:

I just realised that is_none exist and works like charm,

a[~ak.is_none(a)]

Reshape Array in Array in Array

URL: https://stackoverflow.com/questions/64813971/reshape-array-in-array-in-array

Question:

I have a root file that I open with 2000 entries, and variable amount of subentries and in each column is a different variable. Lets say I am only interested in 5 of those. I want to put them in an array with np.shape(array)=(2000,250,5). The 250 is plenty to contain all subentrys per entry.

The root file is converted into a dictionary by uproot DATA=[variablename:[array of entries [array of subentries]]

I create an array np.zeros(2000,250,5) and fill it with the data I want, but it takes about 500ms and I need a solution that scales as I aim for 1 million entries later on. I found multiple solutions, but my lowest was about 300ms

lim_i=len(N_DATA["nTrack"])
i=0
INPUT_ARRAY=np.zeros((lim_i,500,5))
for l in range(len(INPUT_ARRAY)):
    while i < lim_i:
        EVENT=np.zeros((500,5))
        k=0
        lim_k=len(TRACK_DATA["Track_pt"][i])
        while k<lim_k:
            EVENT[k][0]=TRACK_DATA["Track_pt"][i][k]
            EVENT[k][1]=TRACK_DATA["Track_phi"][i][k]
            EVENT[k][2]=TRACK_DATA["Track_eta"][i][k]
            EVENT[k][3]=TRACK_DATA["Track_dxy"][i][k]
            EVENT[k][4]=TRACK_DATA["Track_charge"][i][k]
            k+=1
        INPUT_ARRAY[i]=EVENT
        i+=1
INPUT_ARRAY

Answer:

Observation 1: we can assign directly to the appropriate sub-arrays of INPUT_ARRAY[i], instead of creating EVENT as a proxy for INPUT_ARRAY[i] and then copying that in. (I will also set your variable names in lowercase, to follow normal conventions.

lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
    while i < lim_i:
        k = 0
        lim_k = len(track_data["Track_pt"][i])
        while k < lim_k:
            input_array[i][k][0] = track_data["Track_pt"][i][k]
            input_array[i][k][1] = track_data["Track_phi"][i][k]
            input_array[i][k][2] = track_data["Track_eta"][i][k]
            input_array[i][k][3] = track_data["Track_dxy"][i][k]
            input_array[i][k][4] = track_data["Track_charge"][i][k]
            k += 1
        i += 1

Observation 2: the assignments we make in the innermost loop have the same basic structure. It would be nice if we could take the various entries of the TRACK_DATA dict (which are 2-dimensional data) and stack them together. Numpy has a convenient (and efficient) built-in for stacking 2-dimensional data along the third dimension: np.dstack. Having prepared that 3-dimensional array, we can just copy in from it mechanically:

track_array = np.dstack((
    track_data['Track_pt'],
    track_data['Track_phi'],
    track_data['Track_eta'],
    track_data['Track_dxy'],
    track_data['Track_charge']
))
lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
    while i < lim_i:
        k = 0
        lim_k = len(track_data["Track_pt"][i])
        while k < lim_k:
            input_array[i][k][0] = track_data[i][k][0]
            input_array[i][k][1] = track_data[i][k][1]
            input_array[i][k][2] = track_data[i][k][2]
            input_array[i][k][3] = track_data[i][k][3]
            input_array[i][k][4] = track_data[i][k][4]
            k += 1
        i += 1

Observation 3: but now, the purpose of our innermost loop is simply to copy an entire chunk of track_data along the last dimension. We could just do that directly:

track_array = np.dstack((
    track_data['Track_pt'],
    track_data['Track_phi'],
    track_data['Track_eta'],
    track_data['Track_dxy'],
    track_data['Track_charge']
))
lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
    while i < lim_i:
        k = 0
        lim_k = len(track_data["Track_pt"][i])
        while k < lim_k:
            input_array[i][k] = track_data[i][k]
            k += 1
        i += 1

Observation 4: But actually, the same reasoning applies to the other two dimensions of the array. Clearly, our intent is to copy the entire array produced from the dstack; and that is already a new array, so we could just use it directly.

input_array = np.dstack((
    track_data['Track_pt'],
    track_data['Track_phi'],
    track_data['Track_eta'],
    track_data['Track_dxy'],
    track_data['Track_charge']
))

Another Answer:

Taking note of fKarl Knechtel's second comment, "You should avoid explicitly iterating over Numpy arrays yourself (there is practically guaranteed to be a built-in Numpy thing that just does what you want, and probably much faster than native Python can)," there is a way to do this with array-at-a-time programming, but not in NumPy. The reason Uproot returns Awkward Arrays is because you need a way to deal with variable-length data efficiently.

I don't have your file, but I'll start with a similar one:

>>> import uproot4
>>> import skhep_testdata
>>> events = uproot4.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]

The branches that start with "Muon_" in this file have the same variable-length structure as in your tracks. (The C++ typename is a dynamically sized array, interpreted in Python "as jagged.")

>>> events.show(filter_name="Muon_*")
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Muon_Px              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Py              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Pz              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_E               | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Charge          | int32_t[]                | AsJagged(AsDtype('>i4'))
Muon_Iso             | float[]                  | AsJagged(AsDtype('>f4'))

If you just ask for these arrays, you get them as an Awkward Array.

>>> muons = events.arrays(filter_name="Muon_*")
>>> muons
<Array [{Muon_Px: [-52.9, 37.7, ... 0]}] type='2421 * {"Muon_Px": var * float32,...'>

To put them to better use, let's import Awkward Array and start by asking for its type.

>>> import awkward1 as ak
>>> ak.type(muons)
2421 * {"Muon_Px": var * float32, "Muon_Py": var * float32, "Muon_Pz": var * float32, "Muon_E": var * float32, "Muon_Charge": var * int32, "Muon_Iso": var * float32}

What does this mean? It means you have 2421 records with fields named "Muon_Px", etc., that each contain variable-length lists of float32 or int32, depending on the field. We can look at one of them by converting it to Python lists and dicts.

>>> muons[0].tolist()
{'Muon_Px': [-52.89945602416992, 37.7377815246582],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293],
 'Muon_E': [54.77949905395508, 39.401695251464844],
 'Muon_Charge': [1, -1],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127]}

(You could have made these lists of records, rather than records of lists, by passing how="zip" to TTree.arrays or using ak.unzip and ak.zip in Awkward Array, but that's tangential to the padding that you want to do.)

The problem is that the lists have different lengths. NumPy doesn't have any functions that will help us here because it deals entirely in rectilinear arrays. Therefore, we need a function that's specific to Awkward Array, ak.num.

>>> ak.num(muons)
<Array [{Muon_Px: 2, ... Muon_Iso: 1}] type='2421 * {"Muon_Px": int64, "Muon_Py"...'>

This is telling us the number of elements in each list, per field. For clarity, look at the first one:

>>> ak.num(muons)[0].tolist()
{'Muon_Px': 2, 'Muon_Py': 2, 'Muon_Pz': 2, 'Muon_E': 2, 'Muon_Charge': 2, 'Muon_Iso': 2}

You want to turn these irregular lists into regular lists that all have the same size. That's called "padding." Again, there's a function for that, but we first need to get the maximum number of elements, so that we know how much to pad it by.

>>> ak.max(ak.num(muons))
4

So let's make them all length 4.

>>> ak.pad_none(muons, ak.max(ak.num(muons)))
<Array [{Muon_Px: [-52.9, 37.7, ... None]}] type='2421 * {"Muon_Px": var * ?floa...'>

Again, let's look at the first one to understand what we have.

{'Muon_Px': [-52.89945602416992, 37.7377815246582, None, None],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896, None, None],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293, None, None],
 'Muon_E': [54.77949905395508, 39.401695251464844, None, None],
 'Muon_Charge': [1, -1, None, None],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127, None, None]}

You wanted to pad them with zeros, not None, so we convert the missing values into zeros.

>>> ak.fill_none(ak.pad_none(muons, ak.max(ak.num(muons))), 0)[0].tolist()
{'Muon_Px': [-52.89945602416992, 37.7377815246582, 0.0, 0.0],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896, 0.0, 0.0],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293, 0.0, 0.0],
 'Muon_E': [54.77949905395508, 39.401695251464844, 0.0, 0.0],
 'Muon_Charge': [1, -1, 0, 0],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127, 0.0, 0.0]}

Finally, NumPy doesn't have records (other than the structured array, which also implies that the columns are contiguous in memory; Awkward Array's "records" are abstract). So let's unzip what we have into six separate arrays.

>>> arrays = ak.unzip(ak.fill_none(ak.pad_none(muons, ak.max(ak.num(muons))), 0))
>>> arrays
(<Array [[-52.9, 37.7, 0, 0, ... 23.9, 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[-11.7, 0.693, 0, 0, ... 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[-8.16, -11.3, 0, 0, ... 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[54.8, 39.4, 0, 0], ... 69.6, 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[1, -1, 0, 0], ... [-1, 0, 0, 0]] type='2421 * var * int64'>,
 <Array [[4.2, 2.15, 0, 0], ... [0, 0, 0, 0]] type='2421 * var * float64'>)

Note that this one line does everything from the initial data-pull from Uproot (muons). I'm not going to profile it now, but you'll find that this one line is considerably faster than explicit looping.

Now what we have is semantically equivalent to six NumPy arrays, so we'll just cast them as NumPy. (Attempts to do so with irregular data would fail. You have to explicitly pad the data.)

>>> numpy_arrays = [ak.to_numpy(x) for x in arrays]
>>> numpy_arrays
[array([[-52.89945602,  37.73778152,   0.        ,   0.        ],
        [ -0.81645936,   0.        ,   0.        ,   0.        ],
        [ 48.98783112,   0.82756668,   0.        ,   0.        ],
        ...,
        [-29.75678635,   0.        ,   0.        ,   0.        ],
        [  1.14186978,   0.        ,   0.        ,   0.        ],
        [ 23.9132061 ,   0.        ,   0.        ,   0.        ]]),
 array([[-11.65467167,   0.69347358,   0.        ,   0.        ],
        [-24.40425873,   0.        ,   0.        ,   0.        ],
        [-21.72313881,  29.8005085 ,   0.        ,   0.        ],
        ...,
        [-15.30385876,   0.        ,   0.        ,   0.        ],
        [ 63.60956955,   0.        ,   0.        ,   0.        ],
        [-35.66507721,   0.        ,   0.        ,   0.        ]]),
 array([[ -8.1607933 , -11.3075819 ,   0.        ,   0.        ],
        [ 20.19996834,   0.        ,   0.        ,   0.        ],
        [ 11.16828537,  36.96519089,   0.        ,   0.        ],
        ...,
        [-52.66374969,   0.        ,   0.        ,   0.        ],
        [162.17631531,   0.        ,   0.        ,   0.        ],
        [ 54.71943665,   0.        ,   0.        ,   0.        ]]),
 array([[ 54.77949905,  39.40169525,   0.        ,   0.        ],
        [ 31.69044495,   0.        ,   0.        ,   0.        ],
        [ 54.73978806,  47.48885727,   0.        ,   0.        ],
        ...,
        [ 62.39516068,   0.        ,   0.        ,   0.        ],
        [174.20863342,   0.        ,   0.        ,   0.        ],
        [ 69.55621338,   0.        ,   0.        ,   0.        ]]),
 array([[ 1, -1,  0,  0],
        [ 1,  0,  0,  0],
        [ 1, -1,  0,  0],
        ...,
        [-1,  0,  0,  0],
        [-1,  0,  0,  0],
        [-1,  0,  0,  0]]),
 array([[4.20015335, 2.1510613 , 0.        , 0.        ],
        [2.18804741, 0.        , 0.        , 0.        ],
        [1.41282165, 3.38350415, 0.        , 0.        ],
        ...,
        [3.76294518, 0.        , 0.        , 0.        ],
        [0.55081069, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        ]])]

And now NumPy's dstack is appropriate. (This is making them contiguous in memory, so you could use NumPy's structured arrays if you want to. I would find that easier for keeping track of which index means which variable, but that's up to you. Actually, Xarray is particularly good at tracking metadata of rectilinear arrays.)

>>> import numpy as np
>>> np.dstack(numpy_arrays)
array([[[-52.89945602, -11.65467167,  -8.1607933 ,  54.77949905,
           1.        ,   4.20015335],
        [ 37.73778152,   0.69347358, -11.3075819 ,  39.40169525,
          -1.        ,   2.1510613 ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ -0.81645936, -24.40425873,  20.19996834,  31.69044495,
           1.        ,   2.18804741],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ 48.98783112, -21.72313881,  11.16828537,  54.73978806,
           1.        ,   1.41282165],
        [  0.82756668,  29.8005085 ,  36.96519089,  47.48885727,
          -1.        ,   3.38350415],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       ...,

       [[-29.75678635, -15.30385876, -52.66374969,  62.39516068,
          -1.        ,   3.76294518],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[  1.14186978,  63.60956955, 162.17631531, 174.20863342,
          -1.        ,   0.55081069],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ 23.9132061 , -35.66507721,  54.71943665,  69.55621338,
          -1.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]]])

Implementing numpy.linalg.norm for awkward-arrays

URL: https://stackoverflow.com/questions/64949194/implementing-numpy-linalg-norm-for-awkward-arrays

Question:

At the moment, no implementation exists for numpy.linalg.norm with akward-arrays. In awkward1 I can use behavior to overwrite (or add?) implementations as shown in the tutorial. In my use-case I have successfully implemented versions of np.absolute and np.subtract for TVector2 and TVector3 using the tutorial example. However, numpy.linalg.norm fails.

How can I define a behavior for awkward-arrays? E.g. in the example below I tried this with a simple int64 array:

import awkward1 as ak
import numpy as np

def norm(data, *args, **kwargs):
    return np.absolute(data)

ak.behavior[np.linalg.norm, "int64"] = norm

sample = ak.Array(np.arange(10))

print(ak.type(sample))  # 10 * int64
print(np.absolute(sample))  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

print(np.linalg.norm(np.arange(10)))  # 16.881943016134134
print(np.linalg.norm(sample)) # raises TypeError

Answer:

If you haven't seen https://awkward-array.readthedocs.io/en/latest/ak.behavior.html , it's the place to start.

If np.linalg.norm is a ufunc, the following should work:

ak.behavior[np.linalg.norm, "vector2"] = lambda a: np.sqrt(a.x**2 +a.y**2)

where the data are in records with fields x and y that are named "vector2". You can make such a thing with

ak.zip({"x": x_array, "y": y_array}, with_name="vector2")

(Caveat: I haven't tested this—I wrote it on my phone. I'll test it later and edit this answer, or you might be able to edit it.)

If it's not a ufunc, then there isn't a publicly accessible (i.e. no underscores) way to define it, and perhaps there should be.

Also, be aware of the https://github.com/scikit-hep/vector project, which is defining operations like this.

Edit: Looking at the np.linalg.norm documentation, you're right, it's not a ufunc. (isinstance(np.linalg.norm, np.ufunc) is False.)

Also, this function makes some strong assumptions about its input. Two-dimensional arrays are interpreted as matrices, rather than lists of vectors. In Awkward Array, such distinctions would be labeled by metadata. I don't think we want to implement a function like this, since it could wreak havoc on jagged arrays that are supposed to be lists of different length vectors.

I think what you want is to add vector behaviors, like this:

>>> class Vector2(ak.Record):
...     @property
...     def norm(self):
...         return np.sqrt(self.x**2 + self.y**2)
... 
>>> class ArrayOfVector2(ak.Array):
...     @property
...     def norm(self):
...         return np.sqrt(self.x**2 + self.y**2)
... 
>>> ak.behavior["vector2"] = Vector2
>>> ak.behavior["*", "vector2"] = ArrayOfVector2

This just says that if you have a record that is named "vector2", then it should be presented as Python class Vector2, and if you have an array (any depth) of "vector2" records, they should collectively be presented as an ArrayOfVector2 class.

>>> x_array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> y_array = ak.Array([[1.0, 2.0, 3.0], [], [4.0, 5.0]])
>>> vectors = ak.zip({"x": x_array, "y": y_array}, with_name="vector2")
>>> vectors[0, 0]
<Vector2 {x: 1.1, y: 1} type='vector2["x": float64, "y": float64]'>
>>> vectors
<ArrayOfVector2 [[{x: 1.1, y: 1}, ... y: 5}]] type='3 * var * vector2["x...'>

That lets you use the methods and properties you've defined:

>>> vectors[0, 0].norm
1.4866068747318506
>>> vectors.norm
<Array [[1.49, 2.97, 4.46], ... [5.95, 7.43]] type='3 * var * float64'>

But also, this is exactly what the https://github.com/scikit-hep/vector project is doing. Keep an eye on it!

Error when using an awkward array with an index array

URL: https://stackoverflow.com/questions/65616956/error-when-using-an-awkward-array-with-an-index-array

Question:

I currently have a list of values and an awkward array of integer values. I want the same dimension awkward array, but where the values are the indices of the "values" arrays corresponding with the integer values of the awkward array. For instance:

values = ak.Array(np.random.rand(100))
arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))

I want something like values[arr], but that gives the following error:

>>> values[arr]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\awkward\highlevel.py", line 943, in __getitem__
    return ak._util.wrap(self._layout[where], self._behavior)
ValueError: cannot fit jagged slice with length 2 into RegularArray of size 100

If I run it with a loop, I get back what I want:

>>> values = ([values[i] for i in arr])
>>> values
[<Array [0.842, 0.578, 0.159, ... 0.726, 0.702] type='33 * float64'>, <Array [0.509, 0.45, 0.202, ... 0.906, 0.367] type='125 * float64'>]

Is there another way to do this, or is this it? I'm afraid it'll be too slow for my application.

Thanks!

Answer:

If you're trying to avoid Python for loops for performance, note that the first line casts a NumPy array as Awkward with ak.from_numpy (no loop, very fast):

>>> values = ak.Array(np.random.rand(100))

but the second line iterates over data in Python (has a slow loop):

>>> arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))

because a tuple of two NumPy arrays is not a NumPy array. It's a generic iterable, and the constructor falls back to ak.from_iter.

On your main question, the reason that arr doesn't slice values is because arr is a jagged array and values is not:

>>> values
<Array [0.272, 0.121, 0.167, ... 0.152, 0.514] type='100 * float64'>
>>> arr
<Array [[15, 24, 9, 42, ... 35, 75, 20, 10]] type='2 * var * int64'>

Note the types: values has type 100 * float64 and arr has type 2 * var * int64. There's no rule for values[arr].

Since it looks like you want to slice values with arr[0] and then arr[1] (from your list comprehension), it could be done in a vectorized way by duplicating values for each element of arr, then slicing.

>>> # The np.newaxis is to give values a length-1 dimension before concatenating.
>>> duplicated = ak.concatenate([values[np.newaxis]] * 2)
>>> duplicated
<Array [[0.272, 0.121, ... 0.152, 0.514]] type='2 * 100 * float64'>

Now duplicated has length 2 and one level of nesting, just like arr, so arr can slice it. The resulting array also has length 2, but the length of each sublist is the length of each sublist in arr, rather than 100.

>>> duplicated[arr]
<Array [[0.225, 0.812, ... 0.779, 0.665]] type='2 * var * float64'>
>>> ak.num(duplicated[arr])
<Array [33, 125] type='2 * int64'>

If you're scaling up from 2 such lists to a large number, then this would eat up a lot of memory. Then again, the size of the output of this operation would also scale as "length of values" × "length of arr". If this "2" is not going to scale up (if it will be at most thousands, not millions or more), then I wouldn't worry about the speed of the Python for loop. Python scales well for thousands, but not billions (depending, of course, on the size of the things being scaled!).

Efficient method to replace values in awkward array according to a dictionary?

URL: https://stackoverflow.com/questions/65729150/efficient-method-to-replace-values-in-awkward-array-according-to-a-dictionary

Question:

I have a dictionary with integer keys and float values. I also have a 2D awkward array with integer entries (I'm using awkward1). I want to replace these integers with the corresponding float according to the dictionary, keeping the awkward array format.

Assuming the keys run from 0 to 999, my solution so far is something like this:

resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,1000):
    resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)

Is there a faster way to do this?

Update

Minimal reproducible example of my working code:

import awkward as ak # Awkward 1

myArray = ak.from_iter([[0, 1], [2, 1, 0]]) # Creating example array
myDict = {0: 19.5, 1: 34.1, 2: 10.9}

resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,3):
    resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)

myArray:

<Array [[0, 1], [2, 1, 0]] type='2 * var * int64'>

resultArray:

<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

Answer:

When I mentioned in a comment that np.searchsorted is where you should be looking, I hadn't noticed that myDict includes every consecutive integer as a key. Having a dense lookup table like this would allow faster algorithms, which also happen to be simpler in Awkward Array.

So, assuming that there's a key in myDict for each integer from 0 up to some value, you can equally well represent the lookup table as

>>> lookup = ak.Array([myDict[i] for i in range(len(myDict))])
>>> lookup
<Array [19.5, 34.1, 10.9] type='3 * float64'>

The problem of picking values at 0, 1, and 2 becomes just an array-slice. (This array-slice is an O(n) algorithm for array length n, unlike np.searchsorted, which would be O(n log n). That's the cost of having sparse lookup keys.)

The problem, however, is that myArray is nested and lookup is not. We can give lookup the same depth as myArray by slicing it up:

>>> multilookup = lookup[np.newaxis][np.zeros(len(myArray), np.int64)]
>>> multilookup
<Array [[19.5, 34.1, 10.9, ... 34.1, 10.9]] type='2 * 3 * float64'>
>>> multilookup.tolist()
[[19.5, 34.1, 10.9], [19.5, 34.1, 10.9]]

And then multilookup[myArray] is exactly what you want:

>>> multilookup[myArray]
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

The lookup had to be duplicated because each list within myArray uses global indexes in the whole lookup. If the memory involved in creating multilookup is prohibitive, you could instead break myArray down to match it:

>>> flattened, num = ak.flatten(myArray), ak.num(myArray)
>>> flattened
<Array [0, 1, 2, 1, 0] type='5 * int64'>
>>> num
<Array [2, 3] type='2 * int64'>
>>> lookup[flattened]
<Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
>>> ak.unflatten(lookup[flattened], nums)
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

If your keys are not dense from 0 up to some integer, then you'll have to use np.searchsorted:

>>> keys = ak.Array(myDict.keys())
>>> values = ak.Array([myDict[key] for key in keys])
>>> keys
<Array [0, 1, 2] type='3 * int64'>
>>> values
<Array [19.5, 34.1, 10.9] type='3 * float64'>

In this case, the keys are trivial because it is dense. When using np.searchsorted, you have to explicitly cast the flat Awkward Arrays as NumPy (for now; we're looking to fix that).

>>> lookup_index = np.searchsorted(np.asarray(keys), np.asarray(flattened), side="left")
>>> lookup_index
array([0, 1, 2, 1, 0])

Then we pass it through the trivial keys (which doesn't change it, in this case) before passing it to the values.

>>> keys[lookup_index]
<Array [0, 1, 2, 1, 0] type='5 * int64'>
>>> values[keys[lookup_index]]
<Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
>>> ak.unflatten(values[keys[lookup_index]], num)
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

But the thing I was waffling about in yesterday's comment was that you have to do this on the flattened form of myArray (flattened) and reintroduce the structure later ak.unflatten, as above. But perhaps we should wrap np.searchsorted as ak.searchsorted to recognize a fully structured Awkward Array in the second argument, at least. (It has to be unstructured to be in the first argument.)

while loop equivalent function/snippet in awkward array

URL: https://stackoverflow.com/questions/64282188/while-loop-equivalent-function-snippet-in-awkward-array

Question:

I have a function which I would like to convert so that i can use with awkward array 1.

Function following which used to work for float but not for awkward arrays for the known reasons.

def Phi_mpi_pi(x):
    kPI=(3.14159265)
    kTWOPI = 2 * kPI

    #while ((x.any() >= kPI).any()): x = x - kTWOPI;                                                                                                                                                        
    #while ((x.any() < -kPI).any()): x = x + kTWOPI;                                                                                                                                                        
    while ((x >= kPI)): x = x - kTWOPI;
    while ((x < -kPI)): x = x + kTWOPI;
    return x;

I tried to convert it into numpy/awkward compatible form and new function look like

def Phi_mpi_pi(x):
    kPI=numpy.array(3.14159265)
    kPI = kPI.repeat(len(x))
    kTWOPI = 2 * kPI
    while ((x >= kPI)): x = x - kTWOPI;
    while ((x < -kPI)): x = x + kTWOPI;
    return x;

This function remains stuck in the while loop forever, I couldn't find a way to debug it.

Task of the function is to keep the values in an awkward array between +- kPI but this logic does not give the desired results.

e.g.

x=ak.Array([[0.7999999999999998, 1.0, -1.3], [], [-1.4], [-1.8000000000000003, -6.1000000000000005, -1.6000000000000005], [-4.6]])

However ((x < -kPI)) this give desired output.

>>> ak.to_list(x <= -kPI)
    [[False, False, False], [], [False], [False, True, False], [True]]

but not the function

the desired output should be b/w +- kPI based on the logic of while loop, is there something straightforward or suggestion which can be used?

Answer:

Okay, got it. You want to adjust every single scalar value in x (which are all angles) to be between -π and π.

You can do it like this:

def Phi_mpi_pi(x):
    y = numpy.add(x, numpy.pi)
    y = numpy.mod(y, 2*numpy.pi)
    y = numpy.subtract(y, numpy.pi)
    return y

Or, more terse and much less readable:

def Phi_mpi_pi(x):
    return numpy.subtract(numpy.mod(numpy.add(x, numpy.pi), 2*numpy.pi), numpy.pi)

What it does is this:

Add π to all angles (so they point in the opposite direction).
Take modulo 2π on all angles so they are all from 0 to 2π (2π not included).
Subtract π from all the angles again (so they point in the right direction again). Now they are all from -π to +π (+π not included).

Test:

x = ak.Array([[0.3, 3.1, numpy.pi, -numpy.pi, -4 * numpy.pi,
               200 * numpy.pi, 2 * numpy.pi, -400 * numpy.pi], 
              [], [-1.4], [-1.8, -6, -1.6], [-4.6]])
y = Phi_mpi_pi(x)
print("Type of result:", type(y))
print("Result =", y)

# Check that the resulting range of each value is correct.
range_is_correct = (ak.all(y >= -numpy.pi) and ak.all(y < numpy.pi))

# Calculate the factors of the 2π adjustments.
factors = (x - y) / (2 * numpy.pi)
print("2π factors=", factors)

# Test that all factors of the 2π adjustmenst are approximate integers.
adjustments_are_correct = ak.all(numpy.abs(numpy.mod(factors, 1)) < 1e-15)

# Test that all values are correct, given the test input.
print("Result is correct:", range_is_correct and adjustments_are_correct)

gives this output:

Type of result: <class 'awkward1.highlevel.Array'>
Result = [[0.3, 3.1, -3.14, -3.14, 0, 1.78e-14, 0, 4.62e-14, ... [-1.8, 0.283, -1.6], [1.68]]
2π factors= [[2.65e-17, 7.07e-17, 1, 0, -2, 100, 1, -200], [], [0], [0, -1, 0], [-1]]
Result is correct: True

which proves that the operation was correctly performed with the specifically used test data.

Was that what you wanted?

Cartesian (cross) products and np.unique()

URL: https://stackoverflow.com/questions/64392308/cartesian-cross-products-and-np-unique

Question:

Depending on your few on my approach this is either a question about using np.unique() on awkward1 arrays or a call for a better approach:

Let a and b be two awkward1 arrays of the same outer length (number of events) but different inner lengths. For example:

a = [[1, 2], [3] , [] , [4, 5, 6]]

b = [[7] , [3, 5], [6], [8, 9]]

Let f: (x, y) -> z be a function that acts on two numbers x and y and results in the number z. For example:

f(x, y):= y - x

The idea is to compare every element in a with every element in b via f for each event and filter out the matches of a and b pairs that survive some cut applied to f. For example:

f(x, y) < 4

My approach for this is:

a = ak.from_iter(a)
b = ak.from_iter(b)

c = ak.cartesian({'x':a, 'y':b})
#c= [[{'x': 1, 'y': 7}, {'x': 2, 'y': 7}], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 4, 'y': 8}, {'x': 4, 'y': 9}, {'x': 5, 'y': 8}, {'x': 5, 'y': 9}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]

i = ak.argcartesian({'x':a, 'y':b})
#i= [[{'x': 0, 'y': 0}, {'x': 1, 'y': 0}], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}], [], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}, {'x': 1, 'y': 0}, {'x': 1, 'y': 1}, {'x': 2, 'y': 0}, {'x': 2, 'y': 1}]]

diff = c['y'] - c['x']
#diff= [[6, 5], [0, 2], [], [4, 5, 3, 4, 2, 3]]

cut = diff < 4
#cut= [[False, False], [True, True], [], [False, False, True, False, True, True]]

new = c[cut]
#new= [[], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 5, 'y': 8}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]

new_i = i[cut]
#new_i= [[], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}], [], [{'x': 1, 'y': 0}, {'x': 2, 'y': 0}, {'x': 2, 'y': 1}]]

It is possible that pairs with the same element from a but different elements from b survive the cut. (e.g. {'x': 3, 'y': 3} and {'x': 3, 'y': 5})

My goal is to group those pairs with the same element from a together and therefore reshape the new array into:

new = [[], [{'x': 3, 'y': [3, 5]}], [], [{'x': 5, 'y': 8}, {'x': 6, 'y': [8, 9]}]]

My only idea how to achieve this is to create a list of the indexes from a that are still present after the cut by using new_i:

i = new_i['x']
#i= [[], [0, 0], [], [1, 2, 2]]

However, I need a unique version of this list to make every index appear only once. This could be achieved with np.unique() in NumPy. But doesn't work in awkward1:

np.unique(i)

<__array_function__ internals> in unique(*args, **kwargs)

TypeError: no implementation found for 'numpy.unique' on types that implement __array_function__: [<class 'awkward1.highlevel.Array'>]

My question:

Is their a np.unique() equivalent in awkward1 and/or would you recommend a different approach to my problem?

Answer:

Okay, I still don't know how to use np.unique() on my arrays, but I found a solution for my own problem:

In my previous approach I used the following code to pair up booth arrays.

c = ak.cartesian({'x':a, 'y':b})
#c= [[{'x': 1, 'y': 7}, {'x': 2, 'y': 7}], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 4, 'y': 8}, {'x': 4, 'y': 9}, {'x': 5, 'y': 8}, {'x': 5, 'y': 9}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]

However, with the nested = True parameter from ak.cartesian() I get a list grouped by the elements of a:

c = ak.cartesian({'x':a, 'y':b}, axis = 1, nested = True)
#c= [[[{'x': 1, 'y': 7}], [{'x': 2, 'y': 7}]], [[{'x': 3, 'y': 3}, {'x': 3, 'y': 5}]], [], [[{'x': 4, 'y': 8}, {'x': 4, 'y': 9}], [{'x': 5, 'y': 8}, {'x': 5, 'y': 9}], [{'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]]

After the cut I end up with:

new = c[cut]
#new= [[[], []], [[{'x': 3, 'y': 3}, {'x': 3, 'y': 5}]], [], [[], [{'x': 5, 'y': 8}], [{'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]]

I extract the y values and reduce the most inner layer of the nested lists of new to only one element:

y = new['y']
#y= [[[], []], [[3, 5]], [], [[], [8], [8, 9]]]

new = ak.firsts(new, axis = 2)
#new= [[None, None], [{'x': 3, 'y': 3}], [], [None, {'x': 5, 'y': 8}, {'x': 6, 'y': 8}]]

(I tried to use ak.firsts() with axis = -1 but it seems to be not implemented yet.)

Now every most inner entry in new belongs to exactly one element from a. By replacing the current y of new with the previously extracted y I end up with my desired result:

new['y'] = y
#new= [[None, None], [{'x': 3, 'y': [3, 5]}], [], [None, {'x': 5, 'y': [8]}, {'x': 6, 'y': [8, 9]}]]

Anyway, should you know a better solution, I'd be pleased to hear it.

Awkward array add attributes in intervals

URL: https://stackoverflow.com/questions/64975326/awkward-array-add-attributes-in-intervals

Question:

I want to extract data out of a root file and get it into shape to end with a numpy array/tensor to fill it into a neural-network. I am already able to get the track data I want in shape with a padding to convert it into a numpy array, but I want to extend my array with the data of the jet they are originated from. So I have the information of all the tracks, of each jet and the intervall of tracks they are corresponded too. My first instinc was to construct an array in the shape of the tracks and using something like np.dstack to merge those two.

import uproot4 as uproot
import numpy as np
import awkward1 as ak

def ak_into_np(ak_array):
    data=np.dstack([ak.to_numpy(x) for x in ak_array])
    return data    

def get_data(filename,padding_size):
        f=uproot.open(filename)
        events= f["btagana/ttree;1"]
        track_data=events.arrays(filter_name=["Track_pt","Track_phi","Track_eta","Track_dxy","Track_dz","Track_charge"])
        jet_interval=events.arrays(filter_name=["Jet_nFirstTrack","Jet_nLastTrack"])
        jet_interval=jet_interval["Jet_nLastTrack"]-jet_interval["Jet_nFirstTrack"]
        
        jet_data=events.arrays(filter_name=["Jet_pt","Jet_phi","Jet_eta"])
        arrays_track=ak.unzip(ak.fill_none(ak.pad_none(track_data, padding_size), 0))
        arrays_interval=ak.unzip(ak.fill_none(ak.pad_none(jet_interval,padding_size),0))
        arrays_jet=ak.unzip(ak.fill_none(ak.pad_none(jet_data,padding_size),0))
        track=ak_into_np(arrays_track)
        jet=ak_into_np(arrays_jet)
        interval=ak_into_np(arrays_interval)
        
        return track,jet,interval

This is where I am so far. For efficiency reason I hope to be able to achieve this in awkward before going into numpy. I tried it in numpy with following:

def extend(track,jet,interval):
    events,tracks,varstrack=(np.shape(track))
    events,jets,varsjet=np.shape(jet)
    jet_into_track_data=[]
    for i in range(events):
        dataloop=[]
        for k in range(jets):
            if interval[i][k][0]!=0 :
                dataloop.append(np.broadcast_to(jet[i][k],(interval[i][k][0],varsjet)))
            else 
    jet_into_track_data.append(dataloop)
        
        
    
            
    return jet_into_track_data

but it already takes about 3 seconds without even achieving my goal for only 2000 events. The aim is basically [track_variables] ->[track_variables,jet_variables if track is in intervall] and it shall be stored [(event1)[[track_1],...,[track_padding_size]],...,(eventn)[[track_1],...,[track_padding_size]]]

Answer:

I don't get to see the structure of your original data and I don't have a clear notion of what the desired final state is, but I can give an example inspired by the above that you can adapt. Also, I'm going to ignore the padding because that only complicates things. You'll probably want to put off padding until you've finished the combinatorics.

The tracks and jets below come from the same set of events (i.e. the arrays have the same lengths and they're jagged, with a different number of tracks in each list and a different number of jets in each list). Since jets are somehow derived from tracks, there are strictly fewer jets than tracks, and I take it that in your problem, the link between them is such that each jet corresponds to a contiguous, non-overlapping set of tracks (the "intervals").

Real tracks and jets would have a lot more properties than these—this is a bare minimum example. I've given the tracks an "id" so we can tell one from another, but these jets only have the inclusive "start" index and exclusive "stop" index.

If your tracks don't come with identifiers, you can add them with ak.local_index.

>>> import awkward1 as ak
>>> tracks = ak.Array([[{"id": 0}, {"id": 1}, {"id": 2}],
...                    [],
...                    [{"id": 0}, {"id": 1}]])
>>> jets = ak.Array([[{"start": 0, "stop": 2}, {"start": 2, "stop": 3}],
...                  [],
...                  [{"start": 1, "stop": 2}, {"start": 0, "stop": 1}]])

If you had all combinations between tracks and jets in each event, then you could use a slice to pick the ones that match. This is particularly useful when you have imprecise matches (that you have to match with ΔR or something). The ak.cartesian function produces lists of combinations, and nested=True groups the results by the first argument:

>>> all_combinations = ak.cartesian([tracks, jets], nested=True)
>>> all_combinations.tolist()
[[[({'id': 0}, {'start': 0, 'stop': 2}),
   ({'id': 0}, {'start': 2, 'stop': 3})],
  [({'id': 1}, {'start': 0, 'stop': 2}),
   ({'id': 1}, {'start': 2, 'stop': 3})],
  [({'id': 2}, {'start': 0, 'stop': 2}),
   ({'id': 2}, {'start': 2, 'stop': 3})]],
 [[]],
 [[({'id': 0}, {'start': 1, 'stop': 2}),
   ({'id': 0}, {'start': 0, 'stop': 1})],
  [({'id': 1}, {'start': 1, 'stop': 2}),
   ({'id': 1}, {'start': 0, 'stop': 1})]]]

We can go from there, selecting "id" values that are between the "start" and "stop" values. I started writing up a solution, but the slicing gets kind of complicated, generating all Cartesian combinations is more computationally expensive than is strictly needed for this problem (though no where near as expensive as writing a for loop!), and the general view of the Cartesian product is more useful for approximate matches than the exact indexes you have.

Instead, let's write a for loop in Numba, a just-in-time compiler for Python. Numba is limited in the Python that it can compile (all types must be known at compile-time), but it can recognize read-only Awkward Arrays and append-only ak.ArrayBuilder.

Here's a loop that considers only tracks and jets in the same event, loops over tracks, and puts the first jet that matches each track into an output ArrayBuilder.

>>> import numba as nb
>>> @nb.njit
... def match(tracks_in_events, jets_in_events, output):
...     for tracks, jets in zip(tracks_in_events, jets_in_events):
...         output.begin_list()
...         for track in tracks:
...             for jet in jets:
...                 if jet.start <= track.id < jet.stop:
...                     output.append(jet)   # at most one
...                     break
...         output.end_list()
...     return output
... 
>>> builder = match(tracks, jets, ak.ArrayBuilder())
>>> builder.snapshot().tolist()
[[{'start': 0, 'stop': 2}, {'start': 0, 'stop': 2}, {'start': 2, 'stop': 3}],
 [],
 [{'start': 2, 'stop': 3}, {'start': 0, 'stop': 2}]]

Notice that these jet objects are duplicated to match the appropriate track. (This "duplication" is actually just a pointer, not really a copy.) To attach this to the tracks, you can assign it:

>>> tracks["jet"] = builder.snapshot()
>>> tracks.tolist()
[[{'id': 0, 'jet': {'start': 0, 'stop': 2}},
  {'id': 1, 'jet': {'start': 0, 'stop': 2}},
  {'id': 2, 'jet': {'start': 2, 'stop': 3}}],
 [],
 [{'id': 0, 'jet': {'start': 2, 'stop': 3}},
  {'id': 1, 'jet': {'start': 0, 'stop': 2}}]]

Here, I've assumed that you want to attach a jet to each track—perhaps you wanted to attach the set of all associated tracks to each jet:

>>> @nb.njit
... def match2(tracks_in_events, jets_in_events, output):
...     for tracks, jets in zip(tracks_in_events, jets_in_events):
...         output.begin_list()
...         for jet in jets:
...             output.begin_list()
...             for track in tracks:
...                 if jet.start <= track.id < jet.stop:
...                     output.append(track)    # all tracks
...             output.end_list()
...         output.end_list()
...     return output
... 
>>> jets["tracks"] = match2(tracks, jets, ak.ArrayBuilder()).snapshot()
>>> jets.tolist()
[[{'start': 0, 'stop': 2, 'tracks': [
      {'id': 0, 'jet': {'start': 0, 'stop': 2}},
      {'id': 1, 'jet': {'start': 0, 'stop': 2}}]},
  {'start': 2, 'stop': 3, 'tracks': [
      {'id': 2, 'jet': {'start': 2, 'stop': 3}}]}],
 [],
 [{'start': 1, 'stop': 2, 'tracks': [
      {'id': 1, 'jet': {'start': 0, 'stop': 2}}]},
  {'start': 0, 'stop': 1, 'tracks': [
      {'id': 0, 'jet': {'start': 0, 'stop': 2}}]}]]

(Since I did both, now there are links both ways.) Attaching jets to tracks, rather than tracks to jets, has the added complication that some tracks might not be associated with any jet, in which case you'd have to account for possibly-missing data. (Hint: make them lists of zero or one jet for "no match" and "yes match," then use ak.firsts to convert empty lists to Nones.)

Or you could make the output of the Numba-compiled function be plain NumPy arrays. Numba knows a lot of NumPy functions. Hint for building up a Numba-compiled function: start with a minimal loop over data that doesn't do anything, test-running it as you go add output. Numba complains with a lot of output when it can't recognize the types of something—which is allowed in non-compiled Python—so it's good to know which change caused it to complain so much.

Hopefully, these examples can get you started!

filtered elements in jagged list in nested dictionary using awkward

URL: https://stackoverflow.com/questions/65731658/filtered-elements-in-jagged-list-in-nested-dictionary-using-awkward

Question:

I have a large nested dictionary with jagged list 'c' :

x = {'first_block': 
     {'unit1': {'a': (3,5,4), 'b': 23, 'c': [10]}, 
      'unit2': {'a': (5,8,7), 'b': 15, 'c': [20,10]}, 
      'unit10k': {'a': (2,4,9), 'b': 10, 'c': [6,10,20,5]}},
     
      'second_block': 
       {'unit1' : {'a': (8,20,14), 'b': 10, 'c': [17,12,9]}, 
        'unit2' : {'a': (9,25,50), 'b': 15, 'c': [17,15,9,4,12]}, 
        'unit12k': {'a': (12,24,9), 'b': 23, 'c': [12,22,15,4]}},
     
      'millionth_block': 
      {'unit1': {'a': (35,64,85), 'b': 64, 'c': [50]}, 
       'unit2': {'a': (56,23,34), 'b': 55, 'c': [89,59,77]},
       'unit5k': {'a': (90,28,12), 'b': 85, 'c': [48,90,27,59]}}}

The elements of 'c' are point labels.

For every unique point label in 'c' I want to produce a filtered list of the corresponding value in 'b',

so for example 'first_block' has unique elements in 'c' of: 5, 6, 10, 20

and i want to obtain/extract the following lists for each 'block', to list each value of 'b' associated with a specific value in 'c' e.g.

first_block:
5: [10]
6: [10]
10: [10,15,23]
20: [10,15]
second_block:
4: [15,23]
9: [10,15]
12: [10,15,23]
15: [15,23]
17: [10,15]
22: [23]
etc.

Any thoughts on how to create this outcome given that 'c' is jagged?

Have been trying to do this by converting to Awkward arrays but documentation is currently sparse, and really don't understand how to do this in Awkward.

Also open to pythonic suggestions which don't involve Awkward

Answer:

I'll attempt to give an Awkward Array solution to this.

First of all, we need to know what in your dataset is really scaling up. I understand that you could have millions of blocks, but that means that you shouldn't represent them with a dict that has unique string-valued keys. The cost of creating strings, comparing strings, and looking things up by strings (though that's helped quite a bit by the dict's hash table) is not a good way to scale, especially if the only information in those strings is an ordering.

It also seems to me that you have a variable number of units (by the fact that the last one is named "unit10k", "unit12k", and "unit5k").

Finally, it took me several re-reads of your problem statement, but it seems that the field named "a" doesn't enter into the problem. I'll ignore it.

Therefore, I'd say your data structure would be better as:

as_python = [
    # first block
    [
        # unit 1
        {"b": 23, "c": [10]},
        # unit 2
        {"b": 15, "c": [20, 10]},
        # ...
        # unit 10k
        {"b": 10, "c": [6, 10, 20, 5]},
    ],
    # second block
    [
        # unit 1
        {"b": 10, "c": [17, 12, 9]},
        # unit 2
        {"b": 15, "c": [17, 15, 9, 4, 12]},
        # ...
        # unit 12k
        {"b": 23, "c": [12, 22, 15, 4]},
    ],
    # ...
    # millionth block
    [
        # unit 1
        {"b": 64, "c": [50]},
        # unit 2
        {"b": 55, "c": [89, 59, 77]},
        # ...
        # unit 15k
        {"b": 85, "c": [48, 90, 27, 59]},
    ],
]

A Pythonic solution to this would be

results = []
for block in as_python:
    block_result = {}
    for unit in block:
        b = unit["b"]    # only look up b once
        for c in unit["c"]:
            if c not in block_result:
                block_result[c] = []
            block_result[c].append(b)
    results.append(block_result)

which results in

[
    {10: [23, 15, 10], 20: [15, 10], 6: [10], 5: [10]},
    {17: [10, 15], 12: [10, 15, 23], 9: [10, 15], 15: [15, 23], 4: [15, 23], 22: [23]},
    {50: [64], 89: [55], 59: [55, 85], 77: [55], 48: [85], 90: [85], 27: [85]},
]

which should be pretty fast. (Short loops using only builtin Python types is surprisingly fast for an uncompiled, dynamically typed virtual machine. In my experience, using only builtin Python types is key.)

As for Awkward Array, I can only get halfway there using vectorized (i.e. "NumPy-like") operations before invoking Numba (JIT-compiled for loops), which reveals that the library needs a "groupby" function.

Of course, you can do the whole thing in Numba, which tends to be the fastest solution overall, but this problem depends crucially on grouping and uniqueness, which in Python is most naturally done with dicts or sets, but it's hard to get Numba to deduce the type of anything beyond pure numbers and arrays. (In order to JIT compile the code and make it fast, Numba must know their types, but Python type annotations are recent enough that they're still being incorporated into Numba.)

So let's start by making the above data structure an Awkward Array. Depending on your data source, there are different ways to do this (this part of the documentation is complete), but I'll just let ak.Array iterate over the above Python data structure.

>>> import awkward as ak
>>> as_awkward = ak.Array(as_python)

Ordinarily, if we have a "all combinations of X and Y" problem, we want to use ak.cartesian. We can't use it right away here because the "b" data and "c" data have different depths:

>>> as_awkward = ak.Array(as_python)
>>> as_awkward.b
<Array [[23, 15, 10], ... 23], [64, 55, 85]] type='3 * var * int64'>
>>> as_awkward.c
<Array [[[10], [20, 10], ... [48, 90, 27, 59]]] type='3 * var * var * int64'>
>>> as_awkward.type
3 * var * {"b": int64, "c": var * int64}
>>> as_awkward.b.type
3 * var * int64
>>> as_awkward.c.type
3 * var * var * int64

To make them match, we can create a new axis using np.newaxis (a.k.a. None, but I like to be explicit):

>>> import numpy as np
>>> bprime = as_awkward.b[:, :, np.newaxis]
>>> bprime
<Array [[[23], [15], [10, ... 64], [55], [85]]] type='3 * var * 1 * int64'>
>>> bprime.type
3 * var * 1 * int64
>>> as_awkward.c.type
3 * var * var * int64

Now we can take the Cartesian products within each unit to make lists of b-c pairs (axis=2).

>>> combinations = ak.cartesian({"c": as_awkward.c, "b": bprime}, axis=2)
>>> combinations
<Array [[[{c: 10, b: 23}], ... c: 59, b: 85}]]] type='3 * var * var * {"c": int6...'>
>>> combinations.type
3 * var * var * {"c": int64, "b": int64}
>>> combinations.tolist()
[
    [
        [
            [{"c": 10, "b": 23}]
        ],
        [
            [{"c": 20, "b": 15}],
            [{"c": 10, "b": 15}]],
        [
            [{"c": 6, "b": 10}],
            [{"c": 10, "b": 10}],
            [{"c": 20, "b": 10}],
            [{"c": 5, "b": 10}],
        ],
    ],
    [
        [
            [{"c": 17, "b": 10}],
            [{"c": 12, "b": 10}],
            [{"c": 9, "b": 10}]
        ],
        [
            [{"c": 17, "b": 15}],
            [{"c": 15, "b": 15}],
            [{"c": 9, "b": 15}],
            [{"c": 4, "b": 15}],
            [{"c": 12, "b": 15}],
        ],
        [
            [{"c": 12, "b": 23}],
            [{"c": 22, "b": 23}],
            [{"c": 15, "b": 23}],
            [{"c": 4, "b": 23}],
        ],
    ],
    [
        [
            [{"c": 50, "b": 64}]
        ],
        [
            [{"c": 89, "b": 55}],
            [{"c": 59, "b": 55}],
            [{"c": 77, "b": 55}]
        ],
        [
            [{"c": 48, "b": 85}],
            [{"c": 90, "b": 85}],
            [{"c": 27, "b": 85}],
            [{"c": 59, "b": 85}],
        ],
    ],
]

Now we have too much structure: as long as the "b" and "c" values are properly connected, you want to mix all units in a block because you're interested in uniqueness of "c" within each block.

>>> flattened = ak.flatten(combinations, axis=-1)
>>> flattened
<Array [[{c: 10, b: 23}, ... c: 59, b: 85}]] type='3 * var * {"c": int64, "b": i...'>
>>> flattened.type
3 * var * {"c": int64, "b": int64}
>>> flattened.tolist()
[
    [
        {"c": 10, "b": 23},
        {"c": 20, "b": 15},
        {"c": 10, "b": 15},
        {"c": 6, "b": 10},
        {"c": 10, "b": 10},
        {"c": 20, "b": 10},
        {"c": 5, "b": 10},
    ],
    [
        {"c": 17, "b": 10},
        {"c": 12, "b": 10},
        {"c": 9, "b": 10},
        {"c": 17, "b": 15},
        {"c": 15, "b": 15},
        {"c": 9, "b": 15},
        {"c": 4, "b": 15},
        {"c": 12, "b": 15},
        {"c": 12, "b": 23},
        {"c": 22, "b": 23},
        {"c": 15, "b": 23},
        {"c": 4, "b": 23},
    ],
    [
        {"c": 50, "b": 64},
        {"c": 89, "b": 55},
        {"c": 59, "b": 55},
        {"c": 77, "b": 55},
        {"c": 48, "b": 85},
        {"c": 90, "b": 85},
        {"c": 27, "b": 85},
        {"c": 59, "b": 85},
    ],
]

Sorting and uniqueness are at the bleeding edge of Awkward Array's capabilities; there are some functions that have been written internally that just haven't been exposed in Python, let alone documented. Fortunately, ak.sort and ak.argsort are documented. We ultimately want to group these b-c pairs by c, we can at least sort them:

>>> sorted = flattened[ak.argsort(flattened.c, axis=-1)]
>>> sorted
<Array [[{c: 5, b: 10}, ... {c: 90, b: 85}]] type='3 * var * {"c": int64, "b": i...'>
>>> sorted.type
3 * var * {"c": int64, "b": int64}
>>> sorted.tolist()
[
    [
        {"c": 5, "b": 10},
        {"c": 6, "b": 10},
        {"c": 10, "b": 23},
        {"c": 10, "b": 15},
        {"c": 10, "b": 10},
        {"c": 20, "b": 15},
        {"c": 20, "b": 10},
    ],
    [
        {"c": 4, "b": 15},
        {"c": 4, "b": 23},
        {"c": 9, "b": 10},
        {"c": 9, "b": 15},
        {"c": 12, "b": 10},
        {"c": 12, "b": 15},
        {"c": 12, "b": 23},
        {"c": 15, "b": 15},
        {"c": 15, "b": 23},
        {"c": 17, "b": 10},
        {"c": 17, "b": 15},
        {"c": 22, "b": 23},
    ],
    [
        {"c": 27, "b": 85},
        {"c": 48, "b": 85},
        {"c": 50, "b": 64},
        {"c": 59, "b": 55},
        {"c": 59, "b": 85},
        {"c": 77, "b": 55},
        {"c": 89, "b": 55},
        {"c": 90, "b": 85},
    ],
]

And now we really, really wish we had a "groupby" function, but there just isn't one. It would be a natural addition and perhaps should be a feature request.

So at this point, we switch to Numba. The problem is much simpler than the original problem because the data to group are already sorted: we just need to see when a value changes to insert boundaries. Awkward Arrays can be passed as arguments into Numba, but they're immutable. To make new arrays, use an ak.ArrayBuilder. Note: when developing a function like this, do it in small steps and remove @nb.njit to try it without JIT-compilation (to be sure you're making the right thing before trying to solve type errors).

import numba as nb

@nb.njit
def groupby(input, output):
    for block in input:
        output.begin_list()
        output.begin_list()
        last = block[0].c  # note: assumes that len(block) >= 1
        for unit in block:
            if unit.c != last:
                output.end_list()
                output.begin_list()
            output.append(unit)
            last = unit.c
        output.end_list()
        output.end_list()
    return output

The input is the array we just made (sorted), and the output is an ArrayBuilder that turns into a real array with snapshot (outside of Numba).

>>> grouped = groupby(sorted, ak.ArrayBuilder()).snapshot()
>>> grouped
<Array [[[{c: 5, b: 10}], ... {c: 90, b: 85}]]] type='3 * var * var * {"c": int6...'>
>>> grouped.type
3 * var * var * {"c": int64, "b": int64}
>>> grouped.tolist()
[
    [
        [{"c": 5, "b": 10}],
        [{"c": 6, "b": 10}],
        [{"c": 10, "b": 23}, {"c": 10, "b": 15}, {"c": 10, "b": 10}],
        [{"c": 20, "b": 15}, {"c": 20, "b": 10}],
    ],
    [
        [{"c": 4, "b": 15}, {"c": 4, "b": 23}],
        [{"c": 9, "b": 10}, {"c": 9, "b": 15}],
        [{"c": 12, "b": 10}, {"c": 12, "b": 15}, {"c": 12, "b": 23}],
        [{"c": 15, "b": 15}, {"c": 15, "b": 23}],
        [{"c": 17, "b": 10}, {"c": 17, "b": 15}],
        [{"c": 22, "b": 23}],
    ],
    [
        [{"c": 27, "b": 85}],
        [{"c": 48, "b": 85}],
        [{"c": 50, "b": 64}],
        [{"c": 59, "b": 55}, {"c": 59, "b": 85}],
        [{"c": 77, "b": 55}],
        [{"c": 89, "b": 55}],
        [{"c": 90, "b": 85}],
    ],
]

Then you can fiddle with that output to get a desired structure. If you want to have a scalar "c" for each list "b", you'll want to use a depth_limit to keep ak.zip from broadcasting it down to the deepest level.

>>> ak.zip({"c": grouped.c[:, :, 0], "b": grouped.b}, depth_limit=2).tolist()
[
    [
        {"c": 5, "b": [10]},
        {"c": 6, "b": [10]},
        {"c": 10, "b": [23, 15, 10]},
        {"c": 20, "b": [15, 10]},
    ],
    [
        {"c": 4, "b": [15, 23]},
        {"c": 9, "b": [10, 15]},
        {"c": 12, "b": [10, 15, 23]},
        {"c": 15, "b": [15, 23]},
        {"c": 17, "b": [10, 15]},
        {"c": 22, "b": [23]},
    ],
    [
        {"c": 27, "b": [85]},
        {"c": 48, "b": [85]},
        {"c": 50, "b": [64]},
        {"c": 59, "b": [55, 85]},
        {"c": 77, "b": [55]},
        {"c": 89, "b": [55]},
        {"c": 90, "b": [85]},
    ],
]

If you want to build a report of that type in Numba directly, you can do it. (This is just to illustrate what the "begin_list" and such are doing—it's like "printing out" a structure, imagining output.begin_list() to print a "[" and output.begin_record() to print a "{", etc.)

@nb.njit
def groupby(input, output):
    for block in input:
        output.begin_list()
        output.begin_record()
        output.field("c").integer(block[0].c)
        output.field("b")
        output.begin_list()
        last = block[0].c
        for unit in block:
            if unit.c != last:
                output.end_list()
                output.end_record()
                output.begin_record()
                output.field("c").integer(unit.c)
                output.field("b")
                output.begin_list()
            output.integer(unit.b)
            last = unit.c
        output.end_list()
        output.end_record()
        output.end_list()
    return output

and

>>> grouped = groupby(sorted, ak.ArrayBuilder()).snapshot()
>>> grouped
<Array [[{c: 5, b: [10]}, ... c: 90, b: [85]}]] type='3 * var * {"c": int64, "b"...'>
>>> grouped.type
3 * var * {"c": int64, "b": var * int64}
>>> grouped.tolist()
# it's the same

As I said above, the absolutely fastest solution is likely to be doing everything in Numba. The Cartesian products, flattening, and sorting all create partially new arrays (reusing as much from the inputs as possible, which is why the Awkward Arrays have to be immutable), which involves allocations and multiple passes over the data. But it's hard to express problems involving dicts, lists, and sets in Numba because it needs typed dicts, lists, and sets. I tried using Numba's typed dict, but it complained about the values of the dict being lists, which are not hashable. (Dict values do not need to be hashable, so I wonder what's going on there.) Just as uniqueness and sorting are the bleeding edge of Awkward Array, typing dicts is the bleeding edge of Numba since the concept of typing in Python is itself rather new.

I didn't performance-test any of these proposed solutions.

awkward1; how to set array dimension as variable?

URL: https://stackoverflow.com/questions/67436879/awkward1-how-to-set-array-dimension-as-variable

Question:

So aim is I'm trying to save some arrays as parquets. I can use a python debugger to reach the point in my code that they are ready for saving. Inside my complicated mess of code they look like;

ipdb> ak.__version__
'1.2.2'
ipdb> array1
<Array [... 0., 1.]], [[50.], [4., 47.]]]] type='1 * 3 * var * var * float64'>
ipdb> array2
<Array [[False, True, True]] type='1 * 3 * bool'>

If I try to save them it doesn't work, the error I get is

ipdb> group = ak.zip({'a': array1, 'b': array2}, depth_limit=1)
ipdb> ak.to_parquet(group, 'test.parquet')
*** ValueError: could not broadcast input array from shape (3) into shape (1)

So I start messing around in the terminal to try and recreate the problem and debug it, but I actually cannot replicate it. Here is what happens;

In [1]: import awkward as ak
In [2]: ak.__version__
'1.2.2'
In [3]: cat = ak.from_iter([[True, False, True]])
In [4]: dog = ak.from_iter([[[], [[50.0], [0.2, 0.1, 0., 0., 0.1]], [[50.0], [21., 0.1, 0.]]]])
In [5]: pets = ak.zip({'dog':dog, 'cat':cat}, depth_limit=1)
In [6]: ak.to_parquet(pets, "test.parquet")
In [7]: # no problems
In [8]: cat
<Array [[False, True, True]] type='1 * var * bool'>

Notice that the dimensions have changed from 1 * 3 * bool to 1 * var * bool. That seems to be the only difference - but I cant seem to work out how to control this?

Having managed to isolate the issue, it wasn't what I thought it was. The problem comes when using np.newaxis to make a new axis in a boolean array, then trying to save it.

dog = ak.from_iter([1, 2, 3])[np.newaxis]
pets = {"dog": dog}
zipped = ak.zip(pets, depth_limit=1)
ak.to_parquet(zipped, "test.parquet")
# works fine

dog = ak.from_iter([True, False, True])[np.newaxis]
pets = {"dog": dog}
zipped = ak.zip(pets, depth_limit=1)
ak.to_parquet(zipped, "test.parquet")

# Gives 
ValueError: could not broadcast input array from shape (3) into shape (1)

I should really know better than to post a question without isolating the problem first. Apologies for wasting your time.

Answer:

In ak.zip, depth_limit=1 means that the arrays are not deeply matched ("zipped") together: the only constraint is that len(array1) == len(array2). Is this not satisfied?

In your pets example, len(cat) == 1 and len(dog) == 1. Since you're asking for depth_limit=1, it doesn't matter that len(cat[0]) == len(dog[0]), though in this case it does (they're both 3). Thus, it would be possible to zip these at depth_limit=2, even though that's not what you're asking for.

Since the error message is saying that the mismatching lengths of array1 and array2 are 3 and 1, that should be easy to inspect in the debugger:

array1[0]
array1[1]
array1[2]   # should be the last one

array2[0]   # should be the only one

I hope this sheds some light on your problem!

Looking more closely, I see that you're telling me that you know the lengths of array1 and array2. They're both length 1. There should be no trouble zipping them at depth_limit=1.

You can make your pets example have exactly the right types by calling ak.to_regular on that axis:

>>> cat = ak.to_regular(ak.from_iter([[True, False, True]]), axis=1)
>>> dog = ak.to_regular(ak.from_iter([[[], [[50.0], [0.2, 0.1, 0., 0., 0.1]], [[50.0], [21., 0.1, 0.]]]]), axis=1)
>>> cat
<Array [[True, False, True]] type='1 * 3 * bool'>
>>> dog
<Array [... 0, 0.1]], [[50], [21, 0.1, 0]]]] type='1 * 3 * var * var * float64'>

So the types are exactly 1 * 3 * bool and 1 * 3 * var * var * float64. Zipping works:

>>> pets = ak.zip({'dog':dog, 'cat':cat}, depth_limit=1)
>>> pets
<Array [... 0]]], cat: [True, False, True]}] type='1 * {"dog": var * var * var *...'>
>>> pets.type
1 * {"dog": var * var * var * float64, "cat": var * bool}

Maybe the array1 and array2 you think you're working with are not what you're really working with?

Use a multidimensional index on a MultiIndex pandas dataframe?

URL: https://stackoverflow.com/questions/68645655/use-a-multidimensional-index-on-a-multiindex-pandas-dataframe

Question:

I have a multiindex pandas dataframe that looks like this (called p_z):

                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
...                  ...
26692 891      -0.459227
      892       0.055993
      893      -0.469857
      894       0.192554
      895       0.155738

[11742280 rows x 1 columns]

I want to be able to select certain rows based on another dataframe (or numpy array) which is multidimensional. It would look like this as a pandas dataframe (called tofpid):

                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7
...                ...
26692 193          649
      194          670
      195          690
      196          725
      197          737

[2006548 rows x 1 columns]

I also have it as an awkward array, where it's a (26692, ) array (each of the entries has a non-standard number of subentries). This is a selection df/array that tells the p_z df which rows to keep. So in entry 0 of p_z, it should keep subentries 0, 2, 4, 5, 7, etc.

I can't find a way to get this done in pandas. I'm new to pandas, and even newer to multiindex; but I feel there ought to be a way to do this. If it's able to be broadcast even better as I'll be doing this over ~1500 dataframes of similar size. If it helps, these dataframes are from a *.root file imported using uproot (if there's another way to do this without pandas, I'll take it; but I would love to use pandas to keep things organised).

Edit: Here's a reproducible example (courtesy of Jim Pavinski's answer; thanks!).

import awkward as ak
import pandas as pd

>>> p_z = ak.Array([[ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  
                      0.338738, 0.636035],
                    [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, 
                     -0.459227]])
>>> p_z = ak.to_pandas(p_z)
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid = ak.to_pandas(tofpid)

Both of these dataframes are produced natively in uproot, but this will reproduce the same dataframes that uproot would (using the awkward library).

Answer:

Here's a reproducible example with enough structure to represent the problem (using the awkward library):

>>> import awkward as ak
>>> 
>>> p_z = ak.Array([
...     [ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  0.338738, 0.636035],
...     [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, -0.459227],
... ])
>>> p_z
<Array [[0.339, 0.636, ... 0.156, -0.459]] type='2 * var * float64'>
>>> 
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid
<Array [[0, 2, 4, 5], [1, 2, 4]] type='2 * var * int64'>

In Pandas form, this is:

>>> df_p_z = ak.to_pandas(p_z)
>>> df_p_z
                  values
entry subentry          
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
      5         0.338738
      6         0.636035
1     0        -0.459227
      1         0.055993
      2        -0.469857
      3         0.192554
      4         0.155738
      5        -0.459227
>>> df_tofpid = ak.to_pandas(tofpid)
>>> df_tofpid
                values
entry subentry        
0     0              0
      1              2
      2              4
      3              5
1     0              1
      1              2
      2              4

As an Awkward Array, what you want to do is slice the first array by the second. That is, you want p_z[tofpid]:

>>> p_z[tofpid]
<Array [[0.339, -0.307, ... -0.47, 0.156]] type='2 * var * float64'>
>>> p_z[tofpid].tolist()
[[0.338738, -0.307365, 0.243284, 0.338738], [0.055993, -0.469857, 0.155738]]

Using Pandas, I managed to do it with this:

>>> df_p_z.loc[df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()]
                  values
entry subentry          
0     0         0.338738
      2        -0.307365
      4         0.243284
      5         0.338738
1     1         0.055993
      2        -0.469857
      4         0.155738

What's happening here is that df_tofpid.reset_index(level=0) turns the "entry" part of the MultiIndex into a column, then apply executes a Python function on each row if axis=1, each row is x.values, and tolist() turns the result into a list of tuples like

>>> df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()
[(0, 0), (0, 2), (0, 4), (0, 5), (1, 1), (1, 2), (1, 4)]

This is what loc needs to select entry/subentry pairs from its MultiIndex.

My Pandas solution has two disadvantages: it's complicated, and it goes through Python iteration and objects, which doesn't scale as well as arrays. There is a good chance that a Pandas expert would find a better solution than mine. There's a lot I don't know about Pandas.

TLorentz vector features in uproot4/vector when calculating invariant mass of a jet

URL: https://stackoverflow.com/questions/70927986/tlorentz-vector-features-in-uproot4-vector-when-calculating-invariant-mass-of-a

Question:

I wish to sum all the 4-momenta of the constituents in a jet. In uproot3 (+ uproot3-methods) there was the functionality of creating a TLorentzVectorArray and just doing .sum()

So this worked fine:

import uproot3
import akward0 as ak

input_file = uproot3.open(input_path)
tree = input_file['Jets']
pt = tree.array('Constituent_pt')
phi = tree.array('Constituent_phi')
eta = tree.array('Constituent_eta')
energy = tree.array('Constituent_energy')
mass = tree.array('Constituent_mass')
p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphie(pt, eta, phi, energy)
jet_p4_u3 = p4.sum()
jet_pt_u3 = jet_p4.pt
jet_eta_u3 = jet_p4.eta
jet_phi_u3 = jet_p4.phi
jet_energy_u3 = jet_p4.energy

However, since uproot3 is deprecated, the way to go according to https://stackoverflow.com/questions/66364294/tlorentz-vector-in-uproot-4 seems to be the vector package. What I tried was the following.

import uproot
import awkward
import vector

input_file = uproot.open(input_path)
tree = input_file['Jets']
pt = tree.arrays()['Constituent_pt']
phi = tree.arrays()['Constituent_phi']
eta = tree.arrays()['Constituent_eta']
energy = tree.arrays()['Constituent_energy']
mass = tree.arrays()['Constituent_mass']
p4 = vector.awk({"pt": pt, "phi": phi, "eta": eta, "energy": energy})

The problem now is that this functionality p4.sum() seems to not exist there. The other possibility that I found was shown in the vector discussion #117. So, now I add after the imports vector.register_awkward() and to the end jet_p4_u4 = ak.Array(p4, with_name="Momentum4D"),

import uproot
import awkward
import vector
vector.register_awkward()

input_file = uproot.open(input_path)
tree = input_file['Jets']
pt = tree.arrays()['Constituent_pt']
phi = tree.arrays()['Constituent_phi']
eta = tree.arrays()['Constituent_eta']
energy = tree.arrays()['Constituent_energy']
mass = tree.arrays()['Constituent_mass']
p4 = ak.Array({"pt": pt, "phi": phi, "eta": eta, "energy": energy})
jet_p4_u4 = ak.Array(p4, with_name="Momentum4D")

The question remains, how do I sum the 4-momenta? When doing ak.sum(jet_p4_u4, axis=-1), only pt and energy seem to have the correct values, eta and phi however are completely different from the result from uproot3.

Update: It seems that since the ```ak.sum`` function is not able to add together the angles in the wanted way, then replacing the summing part with summing x, y, z and energy and constructing the vector like this solves the problem. However, I believe there must be a better way than this. So current working version:

import uproot
import awkward
import vector

input_file = uproot.open(input_path)
tree = input_file['Jets']
pt = tree.arrays()['Constituent_pt']
phi = tree.arrays()['Constituent_phi']
eta = tree.arrays()['Constituent_eta']
energy = tree.arrays()['Constituent_energy']
mass = tree.arrays()['Constituent_mass']
p4 = vector.awk({"pt": pt, "phi": phi, "eta": eta, "energy": energy})
p4_lz = vector.awk({"x": p4.x, "y": p4.y, "z": p4.z, "t": energy})
lz_sum = ak.sum(p4_lz, axis=-1)
jet_p4 = vector.awk({
    "x": lz_sum.x,
    "y": lz_sum.y,
    "z": lz_sum.z,
    "t": lz_sum.t
})
jet_energy = jet_p4.t
jet_mass = jet_p4.tau
jet_phi = jet_p4.phi
jet_pt = jet_p4.rho

Answer:

For a solution that works equally well for flat arrays of Lorentz vectors as for jagged arrays of Lorentz vectors, try this:

import uproot
import awkward as ak
import vector
vector.register_awkward()   # any record named "Momentum4D" will be Lorentz

with uproot.open(input_path) as input_file:
    tree = input_file["Jets"]
    arrays = tree.arrays(filter_name="Constituent_*")
    p4 = ak.zip({
        "pt": arrays.Constituent_pt,
        "phi": arrays.Constituent_phi,
        "eta": arrays.Constituent_eta,
        "energy": arrays.Constituent_energy,
    }, with_name="Momentum4D")
    jet_p4 = ak.zip({
        "px": ak.sum(p4.px, axis=-1),
        "py": ak.sum(p4.py, axis=-1),
        "pz": ak.sum(p4.pz, axis=-1),
        "energy": ak.sum(p4.energy, axis=-1)
    }, with_name="Momentum4D")

Note that the uproot.TTree.arrays function, if given no arguments, will read all TBranches in the TTree. In your function, you read all the data four times, each time selecting a different column from the data that had been read and throwing the rest out.

Also, I don't like the vector.awk function because it can construct arrays of type:

N * Momentum4D[px: var * float64, py: var * float64, pz: var * float64, E: var * float64]

(in other words, each "px" value is a list of floats), rather than what you want:

N * var * Momentum4D[px: float64, py: float64, pz: float64, E: float64]

ak.zip combines the lists so that the "px" of each Lorentz vector is just a number, but you can have nested lists of Lorentz vectors. This only makes a difference if you have jagged arrays, but I'm pointing it out so that no one falls into this trap.

The with_name="Momentum4D" argument labels the records with that name, and having Lorentz-vector behaviors registered with vector.register_awkward() gives all such records Lorentz vector methods. In this case, we're using it so that p4, defined in terms of pt, phi, eta, energy, has properties px, py, pz—in other words, doing coordinate transformations on demand.

There isn't a Lorentz vector summation method that sums over each item in a jagged array (the uproot-methods one was a hack that only works for jagged arrays of Lorentz vectors, no other structures, like jagged-jagged, etc.), so sum the components with ak.sum in Cartesian coordinates.

Using the same counter branch for multiple array branches

URL: https://stackoverflow.com/questions/72621687/using-the-same-counter-branch-for-multiple-array-branches

Question:

I'm trying to output a TTree with the same general format as an input TTree which has the structure:

ttree.show(datatypes.keys())

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Weight               | float                    | AsDtype('>f4')
E_Beam               | float                    | AsDtype('>f4')
Px_Beam              | float                    | AsDtype('>f4')
Py_Beam              | float                    | AsDtype('>f4')
Pz_Beam              | float                    | AsDtype('>f4')
NumFinalState        | int32_t                  | AsDtype('>i4')
E_FinalState         | float[]                  | AsJagged(AsDtype('>f4'))
Px_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
Py_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
Pz_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))

The NumFinalState branch contains the number of elements in all of the *_FinalState array branches. This will always be the case in my work, so it seems wasteful to do the following:

outfile = uproot.recreate("myData_OUT.root")
datatypes = {"Weight": "float32", "E_Beam": "float32", "Px_Beam": "float32", "Py_Beam": "float32", "Pz_Beam": "float32", "NumFinalState": "int32", "E_FinalState": "var * float32", "Px_FinalState": "var * float32", "Py_FinalState": "var * float32", "Pz_FinalState": "var * float32"}
outfile.mktree("kin", datatypes)
outfile["kin"].show()

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Weight               | float                    | AsDtype('>f4')
E_Beam               | float                    | AsDtype('>f4')
Px_Beam              | float                    | AsDtype('>f4')
Py_Beam              | float                    | AsDtype('>f4')
Pz_Beam              | float                    | AsDtype('>f4')
NumFinalState        | int32_t                  | AsDtype('>i4')
nE_FinalState        | int32_t                  | AsDtype('>i4')
E_FinalState         | float[]                  | AsJagged(AsDtype('>f4'))
nPx_FinalState       | int32_t                  | AsDtype('>i4')
Px_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
nPy_FinalState       | int32_t                  | AsDtype('>i4')
Py_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
nPz_FinalState       | int32_t                  | AsDtype('>i4')
Pz_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))

In the documentation, it appears I can use the counter_name argument in mktree to give the counter branches custom names, but it seems to run into trouble if I try to give them the same name:

outfile = uproot.recreate("myData_OUT.root")
datatypes = {"Weight": "float32", "E_Beam": "float32", "Px_Beam": "float32", "Py_Beam": "float32", "Pz_Beam": "float32", "NumFinalState": "int32", "E_FinalState": "var * float32", "Px_FinalState": "var * float32", "Py_FinalState": "var * float32", "Pz_FinalState": "var * float32"}
def counter_name(in_str: str) -> str:
    if "FinalState" in in_str:
        return "NumFinalState"
    return f"n{in_str}"
outfile.mktree("kin", datatypes, counter_name=counter_name)

This code throws an error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
/var/folders/td/j379rd296477649qvl1k8n180000gn/T/ipykernel_24226/1447106219.py in <module>
      5         return "NumFinalState"
      6     return f"n{in_str}"
----> 7 outfile.mktree("kin", datatypes, counter_name=counter_name)

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/writable.py in mktree(self, name, branch_types, title, counter_name, field_name, initial_basket_capacity, resize_factor)
   1268             path,
   1269             directory._file,
-> 1270             directory._cascading.add_tree(
   1271                 directory._file.sink,
   1272                 treename,

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/_cascade.py in add_tree(self, sink, name, title, branch_types, counter_name, field_name, initial_basket_capacity, resize_factor)
   1796             resize_factor,
   1797         )
-> 1798         tree.write_anew(sink)
   1799         return tree
   1800 

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/_cascadetree.py in write_anew(self, sink)
   1114                 # reference to fLeafCount
   1115                 out.append(
-> 1116                     uproot.deserialization._read_object_any_format1.pack(
   1117                         datum["counter"]["tleaf_reference_number"]
   1118                     )

error: required argument is not an integer

and I figure this error is related to uproot trying to make two branches with the same name. Is there any way to get around this? It'll probably be okay to just create a NumFinalState branch manually, since it gets read in by a subsequent program, but just in terms of compactness, it would be nice to not create a bunch of unnecessary branches.

Answer:

Uproot makes one counter branch for each Awkward Array in the dict it's given. Since your Awkward Arrays are arrays of lists of numbers, they're all presumed to have different counters. There isn't a way to manually force them to share a counter; the way it's supposed to work is to join them all into one Awkward Array, which Uproot will recognize as something that should have one counter.

So suppose you have

>>> import awkward as ak
>>> import uproot
>>> E_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Px_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Py_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Pz_FinalState = ak.Array([[11, 22, 33], [], [44, 55]])

Passing each of these individually into the output TTree will make a nE_finalState, nPx_FinalState, etc., as you've seen. So make them one array with ak.zip:

>>> finalstate = ak.zip({"E": E_FinalState, "px": Px_FinalState, "py": Py_FinalState, "pz": Pz_FinalState})
>>> finalstate
<Array [[{E: 1.1, px: 1.1, ... pz: 55}]] type='3 * var * {"E": float64, "px": fl...'>
>>> print(finalstate.type)
3 * var * {"E": float64, "px": float64, "py": float64, "pz": int64}

The key thing is that the type is now number of entries * var * {record}, rather than number of entries * var * float64, individually for each array. It's in the ak.zip function that you find out whether they really do have the same number of entries. (It's a one-time cross-check; the creation of the finalstate array is itself zero-copy.)

Now you can use this when writing a TTree:

>>> outfile = uproot.recreate("myData_OUT.root")
>>> outfile["kin"] = {"finalstate": finalstate}
>>> outfile["kin"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nfinalstate          | int32_t                  | AsDtype('>i4')
finalstate_E         | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_px        | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_py        | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_pz        | int64_t[]                | AsJagged(AsDtype('>i8'))

The counter_name and field_name arguments only control the generation of names. By default, they follow a convention in which the "finalstate" name in the dict gets prepended by "n" for the counter and appended by "_" and the name of each field for the branches (CMS NanoAOD conventions). Those arguments exist so that you can apply a different naming convention, but they don't actually change which counter branches get created. In fact, defining these functions so that they produce the same name might trigger a confusing error message like this—I don't think it's an explicitly checked case.

Oh, and you should also be able to use the uproot.WritableDirectory.mktree constructor (which makes a TTree without data, so that each uproot.WritableTree.extend call can be like each other). The dict syntax would be

>>> outfile.mktree("kin", {"finalstate": finalstate.type})
<WritableTree '/kin' at 0x7f48ff139550>
>>> outfile["kin"].extend({"finalstate": finalstate})

i.e. use the finalstate.type, rather than the finalstate array itself.

Using `dask` to fill `boost_histograms` stored in class in parallel

URL: https://stackoverflow.com/questions/71874244/using-dask-to-fill-boost-histograms-stored-in-class-in-parallel

Question:

I have an dask -boost_histogram question. I have a code structure as follows:

I have a class defined in some script:

class MyHist:
     def __init__(....):
         self.bh = None
     def make_hist(...):
           axis = bh.axis.Regular(....)
     @dask.delayed
     def fill_hist(data)
         self.bh.fill(data)

and in another script I want to fill multiple histograms in parallel with dask. The data are awkward arrays that I read from input, and for that I do something like:

     from dask.distributed import Client
     cl = Client()
     histos = [MyHist(..), MyHist(another...)]
     for i, file in enumerate(files):
         data = dask.delayed(open_file(file))
         for myhist in histos:
             if i ==0:  myhist.make_hist()
             fill_results.append(myhist.fill_hist(data)
      dask.compute(*fill_results)

If I then try to call

for j, h in enumerate(histos):
        print(h.bh)

I get empty histograms. However, if I print the boost histogram inside the fill_hist funciton, the histograms seem to be filled.

Does the dask computation create a deep copy or something of the MyHist object to perform the computation, and hence fill the bh associated with that copy? or am I doing something wrong here?

===================================================================== Update:

I see a similar or worse wall-time when using dask to read and fill than using sequential code. This is the case whether or not I use my code or the suggested answer. For an example that doesn't use an intermediate class, I've written the following code:

files = get_input_file_paths('myprocess')

@dask.delayed
def make_a_var(jet_pt):
    jets_pt = copy(jet_pt)
    jets_pt = ak.mask(jets_pt, ak.count(jets_pt, axis=1)>=1)
    return jets_pt[:, 0]*1e-3

@dask.delayed
def make_and_fill(data, axes):
    h = bh.Histogram(*axes, storage=bh.storage.Weight())
    h.fill(data)
    return h 

batch_size = 4
results = []
for i in range(0, len(files), batch_size):
    batch = []
    for j, file in enumerate(files[i:i+batch_size]):
        data = dask.delayed(read_file(file))
        var = data['jet_pt']
        new_var = make_a_var(var)
        new_var = new_pt.to_numpy() # Needed bc bh breaks for masked ak arrays
        new_var= new_var.compressed()
        for k in range(10):
            axes = (bh.axis.Regular(25, 0, 250), )
            h = make_and_fill(new_var, axes)
            batch.append(h)
    results.append(batch)
dask.compute(*results)

It takes a similar amount of wall-time ~7s to run this code sequentially as well as with dask, for k in range(10). For k in range(100) the parallel code takes 15s and sequential takes 21s, which is not as big of an improvement as I would have thought.

Answer:

I believe Jim's comment is correct w.r.t. the source of the problem; I'll also offer a solution I think may be helpful in solving the problem:

I think the definition of your class makes it difficult to work correctly with dask; that is, you probably will have an easier time if your fill_hist method was actually a free function. And in your loop you are actually calling dask.delayed on an already delayed method (this is likely not what you want to do):

fill_results.append(dask.delayed(myhist.fill_hist(data))
#                                       ^^^^^^^^^
#                                 already delayed method

My suggestion would be to go with a free function:

@dask.delayed
def fill_hist(data, axes, storage=None):
    storage = storage or bh.storage.Double()
    h = bh.Histogram(*axes, storage=storage)
    h.fill(data)
    return h

@dask.delayed
def open_file(fname):
    data = some_function_to_get_data(fname)
    return data

axes = (bh.axis.Regular(100, -10, 10),)  # tuple with a single axis
tasks = []
for f in files:
    data = open_file(f)
    hist = fill_hist(data=data, axes=axes)
    tasks.append(hist)

results = dask.compute(tasks)

This pattern is very similar to how dask-histogram works on its backend, (and dask-histogram has support for dask-awkward!)

Using awkward-array with zip/unzip with two different physics objects

URL: https://stackoverflow.com/questions/72834275/using-awkward-array-with-zip-unzip-with-two-different-physics-objects

Question:

I'm trying to reproduce parts of the Higgs discovery in the Higgs --> 4 leptons channel with open data and making use of awkward. I can do it when the leptons are the same (e.g. 4 muons) with zip/unzip, but is there a way to do it in the 2 muon/2 electron channel? I started with the example in the HSF tutorial

https://hsf-training.github.io/hsf-training-scikit-hep-webpage/04-awkward/index.html

So I now have the following. First get the input file

curl http://opendata.cern.ch/record/12361/files/SMHiggsToZZTo4L.root --output SMHiggsToZZTo4L.root

Then I do the following

import numpy as np
import matplotlib.pylab as plt

import uproot
import awkward as ak

# Helper functions

def energy(m, px, py, pz):
    E = np.sqrt( (m**2) + (px**2 + py**2 + pz**2))
    return E

def invmass(E, px, py, pz):

    m2 = (E**2) - (px**2 + py**2 + pz**2)
    
    if m2 < 0:
        m = -np.sqrt(-m2)
    else:
        m = np.sqrt(m2)
    return m


def convert(pt, eta, phi):
    px = pt * np.cos(phi)
    py = pt * np.sin(phi)
    pz = pt * np.sinh(eta)
    return px, py, pz

####

# Read in the file
infile_name = 'SMHiggsToZZTo4L.root'
infile = uproot.open(infile_name)

# Convert to Cartesian
muon_pt = infile_signal['Events/Muon_pt'].array()
muon_eta = infile_signal['Events/Muon_eta'].array()
muon_phi = infile_signal['Events/Muon_phi'].array()
muon_q = infile_signal['Events/Muon_charge'].array()
muon_mass = infile_signal['Events/Muon_mass'].array()

muon_px,muon_py,muon_pz = convert(muon_pt, muon_eta, muon_phi)
muon_energy = energy(muon_mass, muon_px, muon_py, muon_pz)

# Do the magic
nevents = len(infile['Events/Muon_pt'].array())
# nevents = 1000 # For testing

max_entries = nevents
muons = ak.zip({
    "px": muon_px[0:max_entries],
    "py": muon_py[0:max_entries],
    "pz": muon_pz[0:max_entries],
    "e": muon_energy[0:max_entries],
    "q": muon_q[0:max_entries],
})

quads = ak.combinations(muons, 4)

mu1, mu2, mu3, mu4 = ak.unzip(quads)

mass_try = (mu1.e + mu2.e + mu3.e + mu4.e)**2 - ((mu1.px + mu2.px + mu3.px + mu4.px)**2 + (mu1.py + mu2.py + mu3.py + mu4.py)**2 + (mu1.pz + mu2.pz + mu3.pz + mu4.pz)**2)

mass_try = np.sqrt(mass_try)

qtot = mu1.q + mu2.q + mu3.q + mu4.q

plt.hist(ak.flatten(mass_try[qtot==0]), bins=100,range=(0,300));

And the histogram looks good!

So how would I do this for 2-electron + 2-muon combinations? I would guess there's a way to make lepton_xxx arrays? But I'm not sure how to do this elegantly (quickly) such that I could also create a flag to keep track of what the lepton combinations are?

Thanks!

Matt

Answer:

This could be answered in a variety of ways:

make a union array (mixed data types) of electrons and muons
make an array of electrons and muons that are the same type, but have a flag to indicate flavor (electron vs muon)
use ak.combinations with n=2 for the muons, ak.combinations with n=2 again for the electrons, and then combine them with ak.cartesian (and deal with tuples of tuples, rather than one level of tuples, which would mean two calls to ak.unzip)
break the electron and muon collections down into single-charge collections. You'll want exactly 1 positive muon, 1 negative muon, 1 positive electron, and 1 negative electron, so that would be an ak.cartesian of the four collections.

I'll go with the last method because I've decided that it's easiest.

Another thing you probably want to know about is the Vector library. After

import vector
vector.register_awkward()

you don't have to do explicit coordinate transformations or mass calculations. I'll be using that. Here's how I read in the data:

infile = uproot.open("/tmp/SMHiggsToZZTo4L.root")

muon_branch_arrays = infile["Events"].arrays(filter_name="Muon_*")
electron_branch_arrays = infile["Events"].arrays(filter_name="Electron_*")

muons = ak.zip({
    "pt": muon_branch_arrays["Muon_pt"],
    "phi": muon_branch_arrays["Muon_phi"],
    "eta": muon_branch_arrays["Muon_eta"],
    "mass": muon_branch_arrays["Muon_mass"],
    "charge": muon_branch_arrays["Muon_charge"],
}, with_name="Momentum4D")
electrons = ak.zip({
    "pt": electron_branch_arrays["Electron_pt"],
    "phi": electron_branch_arrays["Electron_phi"],
    "eta": electron_branch_arrays["Electron_eta"],
    "mass": electron_branch_arrays["Electron_mass"],
    "charge": electron_branch_arrays["Electron_charge"],
}, with_name="Momentum4D")

And this reproduces your plot:

quads = ak.combinations(muons, 4)
quad_charge = quads["0"].charge + quads["1"].charge + quads["2"].charge + quads["3"].charge
mu1, mu2, mu3, mu4 = ak.unzip(quads[quad_charge == 0])
plt.hist(ak.flatten((mu1 + mu2 + mu3 + mu4).mass), bins=100, range=(0, 200));

The quoted number slices (e.g. "0" and "1") are picking tuple fields, rather than array entries; it's a manual ak.unzip. (The fields could have had real names if I had given a fields argument to ak.combinations.)

For the two muons, two electrons case, let's make four distinct collections.

muons_plus = muons[muons.charge > 0]
muons_minus = muons[muons.charge < 0]
electrons_plus = electrons[electrons.charge > 0]
electrons_minus = electrons[electrons.charge < 0]

The ak.combinations function (with default axis=1) returns the Cartesian product of each list in the array of lists with itself, excluding an item with itself, (x, x) (unless you want that, specify replacement=True), and excluding one of each symmetric pair (x, y)/(y, x).

If you want just a plain Cartesian product of lists from one array with lists from another array, that's ak.cartesian. The four collections muons_plus, muons_minus, electrons_plus, electrons_minus are non-overlapping and we want each four-lepton group to have exactly one from each, so that's a plain Cartesian product.

mu1, mu2, e1, e2 = ak.unzip(ak.cartesian([muons_plus, muons_minus, electrons_plus, electrons_minus]))
plt.hist(ak.flatten((mu1 + mu2 + e1 + e2).mass), bins=100, range=(0, 200));

Separating particles by flavor (electron vs muon) but not by charge is an artifact of the typing constraints imposed by C++. Electrons are measured in different ways from muons, so they have different attributes in the dataset. In a statically typed language like C++, we couldn't put them in the same collection because they differ by type (have different attributes). But charges only differ by value (an integer), so there was no reason they had to be in different collections.

But now, the only thing distinguishing a type is (a) what attributes the objects have and (b) what objects with that set of attributes are named. Here, I named them "Momentum4D" because that allowed Vector to recognize them as Lorentz vectors and give them Lorentz vector methods. But they could have had other attributes (e.g. e_over_p for electrons, but not for muons). charge is a non-Momentum4D attribute.

So if we're going to the trouble to put different-flavor leptons in different arrays, why not put different-charge leptons in different arrays, too? Physically, flavor and charge are both particle properties at the same level of distinction, so we probably want to do the analysis that way. No reason to let a C++ type distinction get in the way of the analysis logic!

Also, you probably want

nevents = infile["Events"].num_entries

to get the number of entries from a TTree without reading the array data from it.

Writing Trees, number of baskets and compression (uproot)

URL: https://stackoverflow.com/questions/72187937/writing-trees-number-of-baskets-and-compression-uproot

Question:

I am trying to optimize the way trees are written in pyroot and came across uproot. In the end my application should write events (consisiting of arrays) to a tree which are continuously coming in.

The first approach is the classic way:

event= [1.,2.,3.]
f = ROOT.TFile("my_tree.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")

pt = array.array('f', [0.]*3)


tree.Branch("pt", pt, "pt[3]/F")

#for loop to simulate incoming events
for _ in range(10000):
	for i, element in enumerate(event):
		pt[i] = element

	tree.Fill()

tree.Print()
tree.Write("", ROOT.TObject.kOverwrite);
f.Close()

This gives the following Tree and execution time:

Tree characterisitics

Trying to do it with uproot my code looks like this:

np_array = np.array([[1,2,3]])
ak_array = ak.from_numpy(np_array)

with uproot.recreate("testing.root", compression=None) as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    
    for _ in range(10000):
        
        fout["tree"].extend({"branch": ak_array})

which gives the following tree:

Tree characteristics

So the uproot method takes much longer, the file size is much bigger and each event gets a seperate basket. I tried out different commpression settings but that did not change anything. Any idea on how to optimize this? Is this even a sensible usecase for uproot and can the process of writing trees being speed up in comparision to the first way of doing it?

Answer:

The extend method is supposed to write a new TBasket with each invocation. (See the documentation, especially the orange warning box. The purpose of that is so that you can control the TBasket sizes.) If you're calling it 10000 times to write 1 value (the value [1, 2, 3]) each, that's a maximally inefficient use.

Fundamentally, you're thinking about this problem in an entry-by-entry way, rather than in terms of columns, the way that scientific processing is normally done in Python. What you want to do instead is to collect a large dataset in memory and write it to the file in one chunk. If the data that you'll eventually be addressing is larger than the memory on your computer, you would do it in "large enough" chunks, which is probably on the order of hundreds of megabytes or gigabytes.

For instance, starting with your example,

import time
import uproot
import numpy as np
import awkward as ak

np_array = np.array([[1, 2, 3]])
ak_array = ak.from_numpy(np_array)

starttime = time.time()

with uproot.recreate("bad.root") as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    for _ in range(10000):
        fout["tree"].extend({"branch": ak_array})

print("Total time:", time.time() - starttime)

The total time (on my computer) is 1.9 seconds and the TTree characteristics are atrocious:

******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    10000 : Total =         1170660 bytes  File  Size =    2970640 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=    1170323 bytes  File Size  =     970000 *
*Baskets :    10000 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

Instead, we want the data to be in a single array (or some loop that produces ~GB scale arrays):

np_array = np.array([[1, 2, 3]] * 10000)

(This isn't necessarily how you would get np_array, since * 10000 makes a large, intermediate Python list. Suffice to say, you get the data somehow.)

Now we do the write with a single call to extend, which makes a single TBasket:

np_array = np.array([[1, 2, 3]] * 10000)
ak_array = ak.from_numpy(np_array)

starttime = time.time()

with uproot.recreate("good.root") as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    fout["tree"].extend({"branch": ak_array})

print("Total time:", time.time() - starttime)

The total time (on my computer) is 0.0020 seconds and the TTree characteristics are much better:

******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    10000 : Total =          240913 bytes  File  Size =       3069 *
*        :          : Tree compression factor = 107.70                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=     240576 bytes  File Size  =       2229 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression= 107.70     *
*............................................................................*

So, the writing is almost 1000× faster and the compression is 100× better. (With one entry per TBasket in the previous example, there was no compression because any compressed data would be bigger than the original!)

By comparison, if we do entry-by-entry writing with PyROOT,

import time
import array
import ROOT

data = [1, 2, 3]
holder = array.array("q", [0]*3)

file = ROOT.TFile("pyroot.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
tree.Branch("branch", holder, "branch[3]/L")

starttime = time.time()
for _ in range(10000):
    for i, x in enumerate(data):
        holder[i] = x

    tree.Fill()

tree.Write("", ROOT.TObject.kOverwrite)
file.Close()

print("Total time:", time.time() - starttime)

The total time (on my computer) is 0.062 seconds and the TTree characteristics are fine:

******************************************************************************
*Tree    :tree      : An Example Tree                                        *
*Entries :    10000 : Total =          241446 bytes  File  Size =       3521 *
*        :          : Tree compression factor =  78.01                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=     241087 bytes  File Size  =       3084 *
*Baskets :        8 : Basket Size=      32000 bytes  Compression=  78.01     *
*............................................................................*

So, PyROOT is 30× slower here, but the compression is almost as good. ROOT decided to make 8 TBaskets, which is configurable with AutoFlush parameters.

Keep in mind, though, that this is a comparison of techniques, not libraries. If you wrap a NumPy array with RDataFrame and write that, then you can skip all of the overhead involved in the Python for loop and you get the advantages of columnar processing.

But columnar processing only matters if you're working with big data. Much like compression, if you apply it to very small datasets (or a very small dataset many times), then it can hurt, rather than help.

Convert array of varying sized arrays to numpy array

URL: https://stackoverflow.com/questions/69845177/convert-array-of-varying-sized-arrays-to-numpy-array

Question:

I am working with a root file (array of arrays). When I load the array into python, I get an awkward array since this is an array of arrays of varying sizes. I would like to learn how to convert this to a numpy array of arrays of the same size, by populating empty elements with NaNs. How can I convert an awkward array of varying size to a numpy array?

Answer:

Suppose that you have an array of variable-length lists a:

>>> import numpy as np
>>> import awkward as ak
>>> a = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
>>> a
<Array [[0, 1, 2], [], ... [5], [6, 7, 8, 9]] type='5 * var * int64'>

The function that makes all lists have the same size is ak.pad_none. But first, we need a size to pad it to. We can get the length of each list with ak.num and then take the np.max of that.

>>> ak.num(a)
<Array [3, 0, 2, 1, 4] type='5 * int64'>
>>> desired_length = np.max(ak.num(a))
>>> desired_length
4

Now we can pad it and convert that into a NumPy array (because it now has rectangular shape).

>>> ak.pad_none(a, desired_length)
<Array [[0, 1, 2, None], ... [6, 7, 8, 9]] type='5 * var * ?int64'>
>>> ak.to_numpy(ak.pad_none(a, desired_length))
masked_array(
  data=[[0, 1, 2, --],
        [--, --, --, --],
        [3, 4, --, --],
        [5, --, --, --],
        [6, 7, 8, 9]],
  mask=[[False, False, False,  True],
        [ True,  True,  True,  True],
        [False, False,  True,  True],
        [False,  True,  True,  True],
        [False, False, False, False]],
  fill_value=999999)

The missing values (None) are converted into a NumPy masked array. If you want a plain NumPy array, you can ak.fill_none to give them a replacement value.

>>> ak.to_numpy(ak.fill_none(ak.pad_none(a, desired_length), 999))
array([[  0,   1,   2, 999],
       [999, 999, 999, 999],
       [  3,   4, 999, 999],
       [  5, 999, 999, 999],
       [  6,   7,   8,   9]])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples dump: StackOverflow

How to use uproot to load referenced values (TRefArray)

Trouble obtaining data from multiple *.root files...but no problems using only one

How to get uproot.iterate() output identical to the root_numpy root2array() output with jagged arrays

Copy TTree from file into a new file with uproot

Plotting with different length of jagged arrays

removing none from an awkward array

Reshape Array in Array in Array

Implementing numpy.linalg.norm for awkward-arrays

Error when using an awkward array with an index array

Efficient method to replace values in awkward array according to a dictionary?

while loop equivalent function/snippet in awkward array

Cartesian (cross) products and np.unique()

Awkward array add attributes in intervals

filtered elements in jagged list in nested dictionary using awkward

awkward1; how to set array dimension as variable?

Use a multidimensional index on a MultiIndex pandas dataframe?

TLorentz vector features in uproot4/vector when calculating invariant mass of a jet

Using the same counter branch for multiple array branches

Using `dask` to fill `boost_histograms` stored in class in parallel

Using awkward-array with zip/unzip with two different physics objects

Writing Trees, number of baskets and compression (uproot)

Convert array of varying sized arrays to numpy array

Clone this wiki locally

Examples dump: StackOverflow

How to use uproot to load referenced values (TRefArray)

Trouble obtaining data from multiple *.root files...but no problems using only one

How to get uproot.iterate() output identical to the root_numpy root2array() output with jagged arrays

Copy TTree from file into a new file with uproot

Plotting with different length of jagged arrays

removing none from an awkward array

Reshape Array in Array in Array

Implementing numpy.linalg.norm for awkward-arrays

Error when using an awkward array with an index array

Efficient method to replace values in awkward array according to a dictionary?

while loop equivalent function/snippet in awkward array

Cartesian (cross) products and np.unique()

Awkward array add attributes in intervals

filtered elements in jagged list in nested dictionary using awkward

awkward1; how to set array dimension as variable?

Use a multidimensional index on a MultiIndex pandas dataframe?

TLorentz vector features in uproot4/vector when calculating invariant mass of a jet

Using the same counter branch for multiple array branches

Using dask to fill boost_histograms stored in class in parallel

Using awkward-array with zip/unzip with two different physics objects

Writing Trees, number of baskets and compression (uproot)

Convert array of varying sized arrays to numpy array

Clone this wiki locally

Using `dask` to fill `boost_histograms` stored in class in parallel