Skip to content

Timeline: Performance analysis for everybody #916

@lukego

Description

@lukego

This issue is to reintroduce Timeline (#849) and keep track of some related work.

What is the timeline?

The timeline is a log file that makes system performance more transparent.

Each log entry has a timestamp with cycle (~nanosecond) precision, a short message, a long explanation, some optional counter values, and other relevant details.

The log level is dynamically adjusted for each iteration of the event loop (breath): most breaths are not logged, many are logged at a high level, and some are logged with fine detail. The log therefore provides samples of operation in units of whole breaths.

The timeline will typically contain one million events recorded over the past few minutes.

Who is it for?

The timeline is intended as a common reference for many different people:

  • Software developers optimizing their applications.
  • Network engineers troubleshooting production deployments.
  • Test engineers performing system stress tests.
  • Infrastructure engineers providing logs from automatic tests.

Having a common tool will hopefully make it easier for these people to communicate with each other.

I think of the timeline like the flight recorder ("black box") in an airplane. It is a buffer containing data about recent operation and it is robust enough to survive a crash. If a system crashes or behaves strangely during a test run then the first step for every interested party would be to review the timeline and compare their observations.

Example

Here is an excerpt from the timeline file showing a breath of the snabbmark basic1 synthetic benchmark:

numa core       cycles category         message
0    2            2540 engine           start breath(3129): total packets(1595790) bytes(95747400) etherbits(880876080)
0    2             252 engine           initialized breath
0    2           11608 engine           called app Source (type: apps.basic.basic_apps)
0    2             540 engine           pulled input packets
0    2             680 engine           called app Sink (type: apps.basic.basic_apps)
0    2             588 engine           called app Sink (type: apps.basic.basic_apps)
0    2           16888 engine           called app Tee (type: apps.basic.basic_apps)
0    2             184 engine           pushed output packets
0    2            9084 engine           called app Sink (type: apps.basic.basic_apps)
0    2             512 engine           called app Sink (type: apps.basic.basic_apps)
0    2             280 engine           pushed output packets
0    2             176 engine           pushed output packets
0    2             208 engine           end breath(3129): processed packets(510) bytesPerPacket(60)

The first message shows the breath starting along with the total number of packets/bytes/bits that have been processed. The last message shows the breath ending along with the number of packets processed and their average size. The messages in between show more detail: in this each callback made to an app in the engine configuration. The cycle count is shown as a delta compared with the previous log entry, which roughly estimates how long each event took to complete. The currently scheduled CPU core and NUMA node are also included as additional details.

See also longer example in which events are shown from newest to oldest.

Status

The timeline is currently living on a couple of branches:

next action: @lukego needs to submit a version for upstreaming that is unlikely to cause problems for people. This requires ensuring the overhead is negligible and perhaps changing default settings e.g. smaller log file or even logging disabled by default.

Grand vision and parallel work

The timeline is one part of a grand vision for making it easy and productive to optimize and troubleshoot Snabb applications.

More related work:

More indirectly, I hope that timeline will provide the framework for writing fast applications. How do you optimize your application? Break it into pieces that each run for around a thousand CPU cycles and can easily be measured with timeline. This would have the double benefit of making system performance transparent and also dividing the work into the small loops that CPUs and tracing JITs love. We shall see!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions