napi

napi

NAPI (“New API”) is an extension to the device driver packet processing framework, which is designed to improve the performance of high-speed networking. NAPI works through:

Interrupt mitigation
High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.
Packet throttling
When the system is overwhelmed and must drop packets, it's better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.

New drivers should use NAPI if the hardware can support it. However, NAPI additions to the kernel do not break backward compatibility and drivers may still process completions directly in interrupt context if necessary.

1 NAPI Driver design
- 1.1 Hardware Architecture
- 1.2 Locking rules and environmental guarantees
2 NAPI API
3 Advantages
4 Disadvantages
- 4.1 Latency
- 4.2 IRQ masking
5 Issues
- 5.1 IRQ race a.k.a rotting packet
  - 5.1.1 IRQ mask and level-triggered
  - 5.1.2 non-level sensitive IRQs
- 5.2 Scheduling issues
6 External Links

NAPI Driver design

The following is a whirlwind tour of what must be done to create a NAPI-compliant network driver.

For each interrupt vector, the driver must allocate an instance of struct napi_struct. This does not require calling any special function, and the structure is typically embedded in the driver's private structure. Each napi_struct must be initialised and registered before the net device itself, using netif_napi_add(), and unregistered after the net device, using netif_napi_del().

The next step is to make some changes to your driver's interrupt handler. If your driver has been interrupted because a new packet is available, that packet should not be processed at that time. Instead, your driver should disable any further “packet available” interrupts and tell the networking subsystem to poll your driver shortly to pick up all available packets. Disabling interrupts, of course, is a hardware-specific matter between the driver and the adaptor. Arranging for polling is done with a call to:

   void napi_schedule(struct napi_struct *napi);

An alternative form you'll see in some drivers is:

   if (napi_schedule_prep(napi))
       __napi_schedule(napi);

The end result is the same either way. (If napi_schedule_prep() returns zero, it means that there was already a poll scheduled, and you should not have received another interrupt).

The next step is to create a poll() method for your driver; it's job is to obtain packets from the network interface and feed them into the kernel. The poll() prototype is:

   int (*poll)(struct napi_struct *napi, int budget);

The poll() function should process all available incoming packets, much as your interrupt handler might have done in the pre-NAPI days. There are some exceptions, however:

Packets should not be passed to netif_rx(); instead, use:

   int netif_receive_skb(struct sk_buff *skb);

The budget parameter places a limit on the amount of work the driver may do. Each received packet counts as one unit of work. The poll() function may also process TX completions, in which case if it processes the entire TX ring then it should count that work as the rest of the budget. Otherwise, TX completions are not counted.

The poll() function must return the amount of work done.

If and only if the return value is less than the budget, your driver must reenable interrupts and turn off polling. Polling is stopped with:

   void napi_complete(struct napi_struct *napi);

The networking subsystem promises that poll() will not be invoked simultaneously (for the same napi_struct) on multiple processors.

The final step is to tell the networking subsystem about your poll() method. This is done in your initialization code when registering the napi_struct:

   netif_napi_add(dev, &napi, my_poll, 16);

The last parameter, weight, is a measure of the importance of this interface; the number stored here will turn out to be the same number your driver finds in the budget argument to poll(). Gigabit and faster adaptor drivers tend to set weight to 64; smaller values can be used for slower media.

Hardware Architecture

NAPI, however, requires the following features to be available:

DMA ring or enough RAM to store packets in software devices.
Ability to turn off interrupts or maybe events that send packets up the stack.

NAPI processes packet events in what is known as napi→poll() method. Typically, only packet receive events are processed in napi→poll(). The rest of the events MAY be processed by the regular interrupt handler to reduce processing latency (justified also because there are not that many of them).

Note, however, NAPI does not enforce that napi→poll() only processes receive events. Tests with the tulip driver indicated slightly increased latency if all of the interrupt handler is moved to napi→poll(). Also MII/PHY handling gets a little trickier.

The example used in this document is to move the receive processing only to napi→poll(); this is shown with the patch for the tulip driver. For an example of code that moves all the interrupt driver to napi→poll() look at other drivers (tg3, e1000, sky2). There are caveats that might force you to go with moving everything to napi→poll(). Different NICs work differently depending on their status/event acknowledgement setup.

There are two types of event register ACK mechanisms.

what is known as Clear-on-read (COR). When you read the status/event register, it clears everything! The natsemi and sunbmac NICs are known to do this. In this case your only choice is to move all to napi→poll()
Clear-on-write (COW)
- you clear the status by writing a 1 in the bit-location you want. These are the majority of the NICs and work the best with NAPI. Put only receive events in napi→poll(); leave the rest in the old interrupt handler.
- whatever you write in the status register clears every thing.

Can't seem to find any supported by Linux which do this.

Ability to detect new work correctly. NAPI works by shutting down event interrupts when there's work and turning them on when there's none. New packets might show up in the small window while interrupts were being re-enabled (described later). A packet might sneak in during the period we are enabling interrupts. We only get to know about such a packet when the next new packet arrives and generates an interrupt. Essentially, there is a small window of opportunity for a race condition which for clarity we'll refer to as the “rotting packet”.

This is a very important topic and appendix 2 is dedicated for more discussion.

Locking rules and environmental guarantees

Only one CPU at any time can call napi→poll(); for each napi_struct this is because only one CPU can pick the initial interrupt and hence the initial napi_schedule(napi)
The core layer invokes devices to send packets in a round robin format. This implies receive is totaly lockless because of the guarantee that only one CPU is executing it.
Contention can only be the result of some other CPU accessing the rx ring. This happens only in close() and suspend() (when these methods try to clean the rx ring); Driver authors need not worry about this; synchronization is taken care for them by the top net layer.

Local interrupts are enabled (if you don't move all to napi→poll()). For example link/MII and txcomplete continue functioning just the same old way. This improves the latency of processing these events. It is also assumed that the receive interrupt is the largest cause of noise. Note this might not always be true. For these broken drivers, move all to napi→poll().

For the rest of this text, we'll assume that napi→poll() only processes receive events.

NAPI API

netif_napi_add(dev, napi, poll, weight)
Initialises and registers napi structure for polling dev
netif_napi_del(napi)
Unregisters napi structure; must be called after the associated device is unregistered. free_netdev(dev) will call netif_napi_del() for all napi_structs still associated with the net device, so it may not be necessary for the driver to call this directly.
napi_schedule(napi)
Called by an IRQ handler to schedule a poll for napi
napi_schedule_prep(napi)
puts napi in a state ready to be added to the CPU polling list if it is up and running. You can look at this as the first half of napi_schedule(napi).
__napi_schedule(napi)
Add napi to the poll list for this CPU; assuming that napi_schedule_prep(napi) has already been called and returned 1
napi_reschedule(napi)
Called to reschedule polling for napi specifically for some deficient hardware.
napi_complete(napi)
Remove napi from the CPU poll list: it must be in the poll list on current cpu. This primitive is called by napi->poll(), when it completes its work. The structure cannot be out of poll list at this call, if it is then clearly it is a BUG().
__napi_complete(napi)
same as napi_complete but called when local interrupts are already disabled.
napi_disable(napi)
Temporarily disables napi structure from being polled. May sleep if it is currently being polled
napi_enable(napi)
Reenables napi structure for polling, after it was disabled using napi_disable()

Advantages

Performance under high packet load

NAPI provides an “inherent mitigation” which is bound by system capacity as can be seen from the following data collected by Robert Olsson's tests on Gigabit ethernet (e1000):

Psize	Ipps	Tput	Rxint	Txint	Done	Ndone
60	890000	409362	17	27622	7	6823
128	758150	464364	21	9301	10	7738
256	445632	774646	42	15507	21	12906
512	232666	994445	241292	19147	241192	1062
1024	119061	1000003	872519	19258	872511	0
1440	85193	1000003	946576	19505	946569	0

Legend:

Ipps
input packets per second
Tput
packets out of total 1M that made it out
Txint
transmit completion interrupts seen
Done
The number of times that the poll() managed to pull all packets out of the rx ring. Note from this that the lower the load the more we could clean up the rxring
Ndone
is the converse of “Done”. Note again, that the higher the load the more times we couldn't clean up the rxring.

Observe that when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. The system can't handle the processing at 1 interrupt/packet at that load level. At lower rates on the other hand, rx interrupts go up and therefore the interrupt/packet ratio goes up (as observable from that table). So there is possibility that under low enough input, you get one poll call for each input packet caused by a single interrupt each time. And if the system can't handle interrupt per packet ratio of 1, then it will just have to chug along.

Use of softirq for other optimizations

NAPI usage does not have to be limited only to receiving packets. With many devices the poll() routine can also be used to manage transmit completion or PHY interface state changes. By moving this processing out of the hardware interrrupt service routine, there may be less latency and better performance.

Hardware Flow control

Most chips with flow control only send a pause packet when they run out of Rx buffers. Since packets are pulled off the DMA ring by a softirq in NAPI, if the system is slow in grabbing them and we have a high input rate (faster than the system's capacity to remove packets), then theoretically there will only be one rx interrupt for all packets during a given packetstorm. Under low load, we might have a single interrupt per packet. Flow control should be programmed to apply in the case when the system can't pull out packets fast enough, i.e send a pause only when you run out of rx buffers.

There are some tradeoffs with hardware flow control. If the driver makes receive buffers available to the hardware one by one, then under load up to 50% of the packets can end up being flow control packets. Flow control works better if the hardware is notified about buffers in larger bursts.

Disadvantages

Latency

In some cases, NAPI may introduce additional software IRQ latency.

IRQ masking

On some devices, changing the IRQ mask may be a slow operation, or require additional locking. This overhead may negate any performance benefits observed with NAPI

Issues

IRQ race a.k.a rotting packet

The are two common race issues that a driver may have to deal with. These are cases where it is possible to cause the receiver to stop because of hardware and logic interaction.

IRQ mask and level-triggered

If a status bit for receive or rxnobuff is set and the corresponding interrupt-enable bit is not on, then no interrupts will be generated. However, as soon as the “interrupt-enable” bit is unmasked, an immediate interrupt is generated (assuming the status bit was not turned off). Generally the concept of level triggered IRQs in association with a status and interrupt-enable CSR register set is used to avoid the race.

If we take the example of the tulip: “pending work” is indicated by the status bit (CSR5 in tulip). The corresponding interrupt bit (CSR7 in tulip) might be turned off (but the CSR5 will continue to be turned on with new packet arrivals even if we clear it the first time). Very important is the fact that if we turn on the interrupt bit when status is set, then an immediate irq is triggered.

If we cleared the rx ring and proclaimed there was “no more work to be done” and then went on to do a few other things; then when we enable interrupts, there is a possibility that a new packet might sneak in during this phase. It helps to look at the pseudo code for the tulip poll routine:

         do {
                 ACK;
                 while (ring_is_not_empty()) {
                         work-work-work
                         if quota is exceeded: exit, no touching irq status/mask
                 }
                 /* No packets, but new can arrive while we are doing this*/
                 CSR5 := read
                 if (CSR5 is not set) {
                         /* If something arrives in this narrow window here,
                          *  where the comments are ;-> irq will be generated */
                         unmask irqs;
                        exit poll;
                }
        } while (rx_status_is_set);

CSR5 bit of interest is only the rx status.

If you look at the last if statement: you just finished grabbing all the packets from the rx ring .. you check if status bit says there are more packets just in … it says none; you then enable rx interrupts again; if a new packet just came in during this check, we are counting that CSR5 will be set in that small window of opportunity and that by re-enabling interrupts, we would actually trigger an interrupt to register the new packet for processing.

non-level sensitive IRQs

Some systems have hardware that does not do level triggered IRQs properly. Normally, IRQs may be lost while being masked and the only way to leave poll is to do a double check for new input after netif_rx_complete() is invoked and re-enable polling (after seeing this new input).

 	.
 	. 
 restart_poll:
 	while (ring_is_not_empty()) {
 		work-work-work
 		if budget is exceeded: exit, not touching irq status/mask
 	}
 	.
 	.
 	.
 	enable_rx_interrupts()
 	napi_complete(napi);
 	if (ring_has_new_packet() && napi_reschedule(napi)) {
 		disable_rx_and_rxnobufs()
 		goto restart_poll
 	} while (rx_status_is_set);

Basically napi_complete() removes us from the poll list, but because a new packet which will never be caught due to the possibility of a race might come in, we attempt to re-add ourselves to the poll list.

Scheduling issues

As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolizing the CPU. This also has the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported to cure problems with low network performance at high CPU load.

Most used processes in a GIGE router:

 USER  PID  %CPU %MEM  SIZE   RSS TTY STAT START     TIME COMMAND
 root    3  0.2  0.0     0     0  ?   RWN  Aug 15  602:00 (ksoftirqd_CPU0)
 root  232  0.0  7.9 41400 40884  ?   S    Aug 15   74:12 gated

External Links

LWN article on network driver porting http://lwn.net/Articles/30107/
Usenix paper http://www.cyberus.ca/~hadi/usenix-paper.tgz
Development files ftp://robur.slu.se/pub/Linux/net-development/NAPI/

Table of Contents