-
Notifications
You must be signed in to change notification settings - Fork 296
Description
This issue is to reintroduce Timeline (#849) and keep track of some related work.
What is the timeline?
The timeline is a log file that makes system performance more transparent.
Each log entry has a timestamp with cycle (~nanosecond) precision, a short message, a long explanation, some optional counter values, and other relevant details.
The log level is dynamically adjusted for each iteration of the event loop (breath): most breaths are not logged, many are logged at a high level, and some are logged with fine detail. The log therefore provides samples of operation in units of whole breaths.
The timeline will typically contain one million events recorded over the past few minutes.
Who is it for?
The timeline is intended as a common reference for many different people:
- Software developers optimizing their applications.
- Network engineers troubleshooting production deployments.
- Test engineers performing system stress tests.
- Infrastructure engineers providing logs from automatic tests.
Having a common tool will hopefully make it easier for these people to communicate with each other.
I think of the timeline like the flight recorder ("black box") in an airplane. It is a buffer containing data about recent operation and it is robust enough to survive a crash. If a system crashes or behaves strangely during a test run then the first step for every interested party would be to review the timeline and compare their observations.
Example
Here is an excerpt from the timeline file showing a breath of the snabbmark basic1
synthetic benchmark:
numa core cycles category message
0 2 2540 engine start breath(3129): total packets(1595790) bytes(95747400) etherbits(880876080)
0 2 252 engine initialized breath
0 2 11608 engine called app Source (type: apps.basic.basic_apps)
0 2 540 engine pulled input packets
0 2 680 engine called app Sink (type: apps.basic.basic_apps)
0 2 588 engine called app Sink (type: apps.basic.basic_apps)
0 2 16888 engine called app Tee (type: apps.basic.basic_apps)
0 2 184 engine pushed output packets
0 2 9084 engine called app Sink (type: apps.basic.basic_apps)
0 2 512 engine called app Sink (type: apps.basic.basic_apps)
0 2 280 engine pushed output packets
0 2 176 engine pushed output packets
0 2 208 engine end breath(3129): processed packets(510) bytesPerPacket(60)
The first message shows the breath starting along with the total number of packets/bytes/bits that have been processed. The last message shows the breath ending along with the number of packets processed and their average size. The messages in between show more detail: in this each callback made to an app in the engine configuration. The cycle count is shown as a delta compared with the previous log entry, which roughly estimates how long each event took to complete. The currently scheduled CPU core and NUMA node are also included as additional details.
See also longer example in which events are shown from newest to oldest.
Status
The timeline is currently living on a couple of branches:
- snabbco/timeline contains the core logging mechanism (doc, source).
- lukego/timeline-engine contains event definitions and instrumentation for the engine.
next action: @lukego needs to submit a version for upstreaming that is unlikely to cause problems for people. This requires ensuring the overhead is negligible and perhaps changing default settings e.g. smaller log file or even logging disabled by default.
Grand vision and parallel work
The timeline is one part of a grand vision for making it easy and productive to optimize and troubleshoot Snabb applications.
More related work:
- Getting our Hydra CI system (hydra.snabb.co) to automatically execute large benchmarking campaigns and publish both metrics and timeline logs. @domenkozar is leading this effort.
- Getting Hydra to also generate Rmarkdown reports (early draft example) to summarize and compare benchmark campaigns and highlight bad results that may indicate bugs. @lukego and @domenkozar are working on this.
- Creating an interactive GUI (early example screenshots) for exploring timeline files and "finding the needles in the million-event haystack." Alexandre Bergel's team at ObjectProfile will pick up @lukego's Snabb Data Studio prototype and take the lead on this.
- <insert your own timeline-related initiative here :-)>
More indirectly, I hope that timeline will provide the framework for writing fast applications. How do you optimize your application? Break it into pieces that each run for around a thousand CPU cycles and can easily be measured with timeline. This would have the double benefit of making system performance transparent and also dividing the work into the small loops that CPUs and tracing JITs love. We shall see!