====== Cyclictest - SMI detection with function tracing ======
One way to identify where [[realtime:documentation:howto:debugging:smi-latency:start|SMI]]s happen is to look for inexplicably long gaps in a kernel function trace, as these usually indicate that an abnormally long amount of time was spent handling an SMI. [[realtime:documentation:howto:tools:cyclictest:start|Cyclictest]] can be used to produce a trace which can then be analyzed to find latencies that could be caused by SMIs. This trace analyzing technique is interesting because it can be used to identify firmware or hardware related latencies on any architecture that supports tracing, not just SMIs on x86 processors.
This page explains how to correctly produce a function trace using Cyclictest and how to identify SMI related latencies in the trace. Additional details that clarify some of the topics mentioned more briefly on this page can be found in the general documentation about debugging using Cyclictest [[realtime:documentation:howto:tools:cyclictest:tracing|here]].
===== Importance =====
Analyzing a function trace that contains the problematic latency, as well as the kernel code that is executing when the latency happens, is an important step in debugging it. Latencies caused by the kernel are much more common than latencies caused by firmware or hardware, so the possibility of a kernel related latency should be eliminated before any further investigation. On the other hand, if a latency is indeed SMI related then knowing the function calls that are made around the SMI can help determine why it is triggered.
===== Setup =====
Before tracing the kernel to identify latencies, it is important to make sure that the CPUs will never go idle. When a CPU enters an idle state there can be long periods of time where the processor effectively does nothing. These idle periods show up in function traces as additional latencies. Since the source of these latencies is already known, having them in the trace does not add information that would help debug other latencies. Instead of having to filter these irrelevant latencies during the trace analysis, it is easier to just avoid having them in the trace in the first place by preventing the CPUs from going idle. Some simple ways of keeping the CPUs busy are described [[realtime:documentation:howto:debugging:no-cpu-idle|here]].
===== Tracing =====
Once the CPUs are sure to never idle, a trace can be obtained using Cyclictest with the break trace and the function tracing options. The break trace option (--breaktrace) will stop tracing and end Cyclictest if it detects a latency that is longer than a specified limit. This option is intended to be used with one of the Cyclictest options that enables a specific Ftrace tracer. For this SMI detection technique, the function tracing option (--ftrace) is used so that the trace includes all the kernel function calls.
The steps for obtaining a relevant trace using Cyclictest are explained in detail [[realtime:documentation:howto:tools:cyclictest:tracing|here]]. Since the desired tracer (function tracer) has already been selected, the section on the linked page about selecting instrumentation can be skipped.
When using the breaktrace option along with the [[realtime:documentation:howto:tools:cyclictest:options:smi-count|SMI count]] option, the SMI count shown in the Cyclictest output may not increment even if the tracing was stopped after a latency caused by an SMI. This is probably because Cyclictest is stopped before the visual output of the SMI count is updated.
To make sure that the tracing stopped at the correct time, look at the break value specified below the Cyclictest output after the test stops. This value should correspond to the expected SMI related latency value even if the SMI count on the Cyclictest output is not updated. The latency should also have occurred in the same conditions as the expected SMI latency.
===== Analysis =====
Once a trace that includes the correct latency is obtained, it needs to be analyzed critically in order to determine the latencies that could be caused by SMIs. First, the latencies in the trace must be identified. Because the trace contains every function call made by the kernel, the timestamps for consecutive tracepoints should only be a few microseconds apart. In theory, any gap between function calls that is longer than a few microseconds could potentially indicate a problem, but it is generally best to start by finding and analyzing the longest latencies in the trace.
==== Identifying all long latencies ====
The longer latencies in the trace should have certain special characters next to their timestamps. For example, a latency of more than 50 us is indicated with a '+' and a latency of more than 100 us is indicated with a '!'. In the following trace excerpt there is an exclamation mark next to the timestamp of the second function call because the second and the third native_io_delay calls are more than 100 us apart.
bash-960 0d...111 38067191us : native_io_delay <-pcspkr_event
bash-960 0d...111 38067200us!: native_io_delay <-pcspkr_event
bash-960 0d...111 38067346us : native_io_delay <-pcspkr_event
The second call was made at 38067200 us and the third call was at 38067346 us, so there was an unusually long 146 us delay between the two calls. The Ftrace documentation can be consulted to see the characters that indicate even longer latencies.
==== Isolating SMI related latencies ====
The next step in the trace analysis is to distinguish the unsuspicious long latencies from the long latencies that could be caused by SMIs. In other words, eliminate all the long latencies that can be explained by something. To do this, analyze the situation in which the latency occurs and have a look at the function code that was executing when the latency happened. Can anything explain the latency? Some possible latency causes could be that the CPU is idle, that uninstrumented userspace code is executing, that the executing kernel function disables interrupts and then does something that takes a long time, etc. If the latency cannot be explained by the situation or by the code and the OS appears to lose the CPU for no reason, then the latency is probably caused by an SMI.