This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
gsoc:2024-gsoc-perf [2024/01/31 16:05] irogers Add a little more detail to bring your own proposal and new PMU drivers |
gsoc:2024-gsoc-perf [2024/02/08 05:08] (current) namhyung [Data type profiling] |
||
---|---|---|---|
Line 25: | Line 25: | ||
Wiki: [[https://perf.wiki.kernel.org/index.php/Main_Page|https://perf.wiki.kernel.org/]] | Wiki: [[https://perf.wiki.kernel.org/index.php/Main_Page|https://perf.wiki.kernel.org/]] | ||
- | Mentor contacts: [[https://sites.google.com/site/rogersemail/home|Ian Rogers]] <irogers+gsoc22 at google dot com>, Namhyung Kim <namhyung at kernel.org> | + | Mentor contacts: [[https://sites.google.com/site/rogersemail/home|Ian Rogers]] <irogers+gsoc24 at google dot com>, Namhyung Kim <namhyung at kernel.org>, Arnaldo Carvalho de Melo <acme at kernel.org> |
===== Qualities of a good proposal ===== | ===== Qualities of a good proposal ===== | ||
Line 36: | Line 36: | ||
==== Bring your open proposal ==== | ==== Bring your open proposal ==== | ||
+ | |||
+ | * complexity: intermediate or hard | ||
+ | * duration: small, medium or large | ||
+ | * requirements: machine to work and test on, typically a bare metal (ie not cloud) Linux machine. C programming, possibly other languages if interested in things like Rust integration. | ||
If you have your own ideas for how tracing and profiling can be improved in the Linux kernel and perf tool then these are welcomed. Some areas that have been brought up in the last year are [[https://lore.kernel.org/linux-perf-users/87o85ftc3p.fsf@smart-cactus.org/|better support for more programming languages]] and [[https://lore.kernel.org/all/20211129231830.1117781-1-namhyung@kernel.org/|new profiling commands like function latency measuring]]. The perf tool is full of metrics like flops and memory bandwidth, so adding a [[https://www.inesc-id.pt/ficheiros/publicacoes/9068.pdf|roofline model]] to determine the bottlenecks of an application would be a possibility. | If you have your own ideas for how tracing and profiling can be improved in the Linux kernel and perf tool then these are welcomed. Some areas that have been brought up in the last year are [[https://lore.kernel.org/linux-perf-users/87o85ftc3p.fsf@smart-cactus.org/|better support for more programming languages]] and [[https://lore.kernel.org/all/20211129231830.1117781-1-namhyung@kernel.org/|new profiling commands like function latency measuring]]. The perf tool is full of metrics like flops and memory bandwidth, so adding a [[https://www.inesc-id.pt/ficheiros/publicacoes/9068.pdf|roofline model]] to determine the bottlenecks of an application would be a possibility. | ||
==== New Performance Monitoring Unit (PMU) kernel drivers ==== | ==== New Performance Monitoring Unit (PMU) kernel drivers ==== | ||
+ | |||
+ | * complexity: intermediate or hard | ||
+ | * duration: small, medium or large | ||
+ | * requirements: machine to work and test on, typically a bare metal (ie not cloud) Linux machine, test hardware for the PMU you are working on. C programming. | ||
Have a computer you really love but can only query the core CPU's PMU? Are there data sheets describing performance monitoring counters that could be exposed through the perf event API? Why not work to add a PMU driver to the Linux kernel and expose those performance counters or even more advanced features like sampling. Drivers can be added for accelerators, GPUs, data buses, caches, etc. For example, the Raspberry Pi 5 has performance counters only for its core CPUs and not for things like its memory bus. | Have a computer you really love but can only query the core CPU's PMU? Are there data sheets describing performance monitoring counters that could be exposed through the perf event API? Why not work to add a PMU driver to the Linux kernel and expose those performance counters or even more advanced features like sampling. Drivers can be added for accelerators, GPUs, data buses, caches, etc. For example, the Raspberry Pi 5 has performance counters only for its core CPUs and not for things like its memory bus. | ||
==== Improved Python integration ==== | ==== Improved Python integration ==== | ||
+ | |||
+ | * complexity: intermediate or hard | ||
+ | * duration: small, medium or large | ||
+ | * requirements: machine to work and test on, typically a bare metal (ie not cloud) Linux machine. C and/or Python programming. | ||
A lot of what makes the perf tool useful is user interface, however, writing user interfaces in C is tedious and error prone. Python support is long established within the perf tool but it could use some TLC. Some examples of work that needs doing are: | A lot of what makes the perf tool useful is user interface, however, writing user interfaces in C is tedious and error prone. Python support is long established within the perf tool but it could use some TLC. Some examples of work that needs doing are: | ||
Line 56: | Line 68: | ||
==== Scalability and speed ==== | ==== Scalability and speed ==== | ||
+ | |||
+ | * complexity: intermediate or hard | ||
+ | * duration: small, medium or large | ||
+ | * requirements: machine to work and test on, typically a bare metal (ie not cloud) Linux machine. C programming, multi-threading/pthread library. | ||
The perf tool is largely single threaded even though sometimes it needs to do something on every CPU in the system. This is embarrassingly parallel but the tool isn't exploiting it. Work was done to create a [[https://lore.kernel.org/lkml/3c4f8dd64d07373d876990ceb16e469b4029363f.camel@gmail.com/|work pool mechanism]] but not merged due to latent bugs in memory management. [[https://perf.wiki.kernel.org/index.php/Reference_Count_Checking|Address sanitizer and reference count checking]] have solved this problem but we still need to integrate the work pool code. | The perf tool is largely single threaded even though sometimes it needs to do something on every CPU in the system. This is embarrassingly parallel but the tool isn't exploiting it. Work was done to create a [[https://lore.kernel.org/lkml/3c4f8dd64d07373d876990ceb16e469b4029363f.camel@gmail.com/|work pool mechanism]] but not merged due to latent bugs in memory management. [[https://perf.wiki.kernel.org/index.php/Reference_Count_Checking|Address sanitizer and reference count checking]] have solved this problem but we still need to integrate the work pool code. | ||
Line 61: | Line 77: | ||
Another improvement is that currently the ''perf report'' command will process an entire perf.data file before providing a visualization. This can be slow for large perf.data files. In contrast, the ''perf top'' command will gather data in the background while providing a visualization. Breaking apart the ''perf report'' command so that processing is performed on a background thread with the visualization periodically refreshing in the foreground will mean that at least during the slow load the user can do something. | Another improvement is that currently the ''perf report'' command will process an entire perf.data file before providing a visualization. This can be slow for large perf.data files. In contrast, the ''perf top'' command will gather data in the background while providing a visualization. Breaking apart the ''perf report'' command so that processing is performed on a background thread with the visualization periodically refreshing in the foreground will mean that at least during the slow load the user can do something. | ||
+ | One more thing can do is to reduce the number of file descriptors in ''perf record'' with ''--threads'' option. Currently it needs a couple of pipes to communicate between the worker threads. I think it can be greatly reduced by using eventfd(2) instead of having pipes for each thread. | ||
+ | |||
+ | |||
+ | ==== Data type profiling ==== | ||
+ | |||
+ | * complexity: intermediate or hard | ||
+ | * duration: medium or large | ||
+ | * requirements: physical machine to work and test on (Intel recommended). C programming, understanding DWARF format is a plus. | ||
+ | |||
+ | Data type profiling is a new technique to show memory access profiles with type information. See [[https://lwn.net/Articles/955709/ | LWN article]] for more detail. It's still in the early stage and has a lot of room for improvement. For example, it needs to support C++ and other languages, better integration to other perf commands like ''annotate'' and ''c2c'', performance optimization, other architecture support and so on. It'd be ok if you're not familiar with ELF or DWARF format. | ||
+ | |||
+ | ==== perf trace and BTF ==== | ||
+ | |||
+ | * complexity: intermediate or hard | ||
+ | * duration: medium or large | ||
+ | * requirements: machine to work and test on. C programming, BPF | ||
+ | |||
+ | ''perf trace'' is similar to ''strace'' but much performant since it doesn't use ptrace. So it needs to capture and understand the format of syscall arguments as ''strace'' does. Right now, it has to build a list of format to pretty-print the syscall args. But we find it limited and manual work. Instead, it could use BTF (BPF type format) which has all the type information and is available in the (most) kernel. | ||