NAPI (“New API”) is an extension to the device driver packet processing framework, which is designed to improve the performance of high-speed networking. NAPI works through:
New drivers should use NAPI if the hardware can support it. However, NAPI additions to the kernel do not break backward compatibility and drivers may still process completions directly in interrupt context if necessary.
The following is a whirlwind tour of what must be done to create a NAPI-compliant network driver.
For each interrupt vector, the driver must allocate an instance of struct napi_struct
. This does not require calling any special function, and the structure is typically embedded in the driver's private structure. Each napi_struct
must be initialised and registered before the net device itself, using netif_napi_add()
, and unregistered after the net device, using netif_napi_del()
.
The next step is to make some changes to your driver's interrupt handler. If your driver has been interrupted because a new packet is available, that packet should not be processed at that time. Instead, your driver should disable any further “packet available” interrupts and tell the networking subsystem to poll your driver shortly to pick up all available packets. Disabling interrupts, of course, is a hardware-specific matter between the driver and the adaptor. Arranging for polling is done with a call to:
void napi_schedule(struct napi_struct *napi);
An alternative form you'll see in some drivers is:
if (napi_schedule_prep(napi)) __napi_schedule(napi);
The end result is the same either way. (If napi_schedule_prep() returns zero, it means that there was already a poll scheduled, and you should not have received another interrupt).
The next step is to create a poll() method for your driver; it's job is to obtain packets from the network interface and feed them into the kernel. The poll() prototype is:
int (*poll)(struct napi_struct *napi, int budget);
The poll() function should process all available incoming packets, much as your interrupt handler might have done in the pre-NAPI days. There are some exceptions, however:
int netif_receive_skb(struct sk_buff *skb);
void napi_complete(struct napi_struct *napi);
The networking subsystem promises that poll() will not be invoked simultaneously (for the same napi_struct
) on multiple processors.
The final step is to tell the networking subsystem about your poll() method. This is done in your initialization code when registering the napi_struct
:
netif_napi_add(dev, &napi, my_poll, 16);
The last parameter, weight, is a measure of the importance of this interface; the number stored here will turn out to be the same number your driver finds in the budget argument to poll(). Gigabit and faster adaptor drivers tend to set weight to 64; smaller values can be used for slower media.
NAPI, however, requires the following features to be available:
NAPI processes packet events in what is known as napi→poll() method. Typically, only packet receive events are processed in napi→poll(). The rest of the events MAY be processed by the regular interrupt handler to reduce processing latency (justified also because there are not that many of them).
Note, however, NAPI does not enforce that napi→poll() only processes receive events. Tests with the tulip driver indicated slightly increased latency if all of the interrupt handler is moved to napi→poll(). Also MII/PHY handling gets a little trickier.
The example used in this document is to move the receive processing only to napi→poll(); this is shown with the patch for the tulip driver. For an example of code that moves all the interrupt driver to napi→poll() look at other drivers (tg3, e1000, sky2). There are caveats that might force you to go with moving everything to napi→poll(). Different NICs work differently depending on their status/event acknowledgement setup.
There are two types of event register ACK mechanisms.
Can't seem to find any supported by Linux which do this.
This is a very important topic and appendix 2 is dedicated for more discussion.
napi_struct
this is because only one CPU can pick the initial interrupt and hence the initial napi_schedule(napi)For the rest of this text, we'll assume that napi→poll() only processes receive events.
napi
structure for polling dev
napi
structure; must be called after the associated device is unregistered. free_netdev(dev) will call netif_napi_del() for all napi_struct
s still associated with the net device, so it may not be necessary for the driver to call this directly.napi
napi
in a state ready to be added to the CPU polling list if it is up and running. You can look at this as the first half of napi_schedule(napi).napi
to the poll list for this CPU; assuming that napi_schedule_prep(napi) has already been called and returned 1napi
specifically for some deficient hardware.napi
from the CPU poll list: it must be in the poll list on current cpu. This primitive is called by napi->poll(), when it completes its work. The structure cannot be out of poll list at this call, if it is then clearly it is a BUG().napi
structure from being polled. May sleep if it is currently being pollednapi
structure for polling, after it was disabled using napi_disable()
NAPI provides an “inherent mitigation” which is bound by system capacity as can be seen from the following data collected by Robert Olsson's tests on Gigabit ethernet (e1000):
Psize | Ipps | Tput | Rxint | Txint | Done | Ndone |
---|---|---|---|---|---|---|
60 | 890000 | 409362 | 17 | 27622 | 7 | 6823 |
128 | 758150 | 464364 | 21 | 9301 | 10 | 7738 |
256 | 445632 | 774646 | 42 | 15507 | 21 | 12906 |
512 | 232666 | 994445 | 241292 | 19147 | 241192 | 1062 |
1024 | 119061 | 1000003 | 872519 | 19258 | 872511 | 0 |
1440 | 85193 | 1000003 | 946576 | 19505 | 946569 | 0 |
Legend:
Observe that when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. The system can't handle the processing at 1 interrupt/packet at that load level. At lower rates on the other hand, rx interrupts go up and therefore the interrupt/packet ratio goes up (as observable from that table). So there is possibility that under low enough input, you get one poll call for each input packet caused by a single interrupt each time. And if the system can't handle interrupt per packet ratio of 1, then it will just have to chug along.
NAPI usage does not have to be limited only to receiving packets. With many devices the poll() routine can also be used to manage transmit completion or PHY interface state changes. By moving this processing out of the hardware interrrupt service routine, there may be less latency and better performance.
Most chips with flow control only send a pause packet when they run out of Rx buffers. Since packets are pulled off the DMA ring by a softirq in NAPI, if the system is slow in grabbing them and we have a high input rate (faster than the system's capacity to remove packets), then theoretically there will only be one rx interrupt for all packets during a given packetstorm. Under low load, we might have a single interrupt per packet. Flow control should be programmed to apply in the case when the system can't pull out packets fast enough, i.e send a pause only when you run out of rx buffers.
There are some tradeoffs with hardware flow control. If the driver makes receive buffers available to the hardware one by one, then under load up to 50% of the packets can end up being flow control packets. Flow control works better if the hardware is notified about buffers in larger bursts.
In some cases, NAPI may introduce additional software IRQ latency.
On some devices, changing the IRQ mask may be a slow operation, or require additional locking. This overhead may negate any performance benefits observed with NAPI
The are two common race issues that a driver may have to deal with. These are cases where it is possible to cause the receiver to stop because of hardware and logic interaction.
If a status bit for receive or rxnobuff is set and the corresponding interrupt-enable bit is not on, then no interrupts will be generated. However, as soon as the “interrupt-enable” bit is unmasked, an immediate interrupt is generated (assuming the status bit was not turned off). Generally the concept of level triggered IRQs in association with a status and interrupt-enable CSR register set is used to avoid the race.
If we take the example of the tulip: “pending work” is indicated by the status bit (CSR5 in tulip). The corresponding interrupt bit (CSR7 in tulip) might be turned off (but the CSR5 will continue to be turned on with new packet arrivals even if we clear it the first time). Very important is the fact that if we turn on the interrupt bit when status is set, then an immediate irq is triggered.
If we cleared the rx ring and proclaimed there was “no more work to be done” and then went on to do a few other things; then when we enable interrupts, there is a possibility that a new packet might sneak in during this phase. It helps to look at the pseudo code for the tulip poll routine:
do { ACK; while (ring_is_not_empty()) { work-work-work if quota is exceeded: exit, no touching irq status/mask } /* No packets, but new can arrive while we are doing this*/ CSR5 := read if (CSR5 is not set) { /* If something arrives in this narrow window here, * where the comments are ;-> irq will be generated */ unmask irqs; exit poll; } } while (rx_status_is_set);
CSR5 bit of interest is only the rx status.
If you look at the last if statement: you just finished grabbing all the packets from the rx ring .. you check if status bit says there are more packets just in … it says none; you then enable rx interrupts again; if a new packet just came in during this check, we are counting that CSR5 will be set in that small window of opportunity and that by re-enabling interrupts, we would actually trigger an interrupt to register the new packet for processing.
Some systems have hardware that does not do level triggered IRQs properly. Normally, IRQs may be lost while being masked and the only way to leave poll is to do a double check for new input after netif_rx_complete() is invoked and re-enable polling (after seeing this new input).
. . restart_poll: while (ring_is_not_empty()) { work-work-work if budget is exceeded: exit, not touching irq status/mask } . . . enable_rx_interrupts() napi_complete(napi); if (ring_has_new_packet() && napi_reschedule(napi)) { disable_rx_and_rxnobufs() goto restart_poll } while (rx_status_is_set);
Basically napi_complete() removes us from the poll list, but because a new packet which will never be caught due to the possibility of a race might come in, we attempt to re-add ourselves to the poll list.
As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolizing the CPU. This also has the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported to cure problems with low network performance at high CPU load.
Most used processes in a GIGE router:
USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated