Boost socket performance on Linux

Four ways to speed up your network applications

Level: Intermediate

M. Tim Jones (mailto:mtj@mtjones.com?subject=Boost socket performance on Linux&cc=tomyoung@us.ibm.com), Consultant Engineer, Emulex

17 Jan 2006
Updated 03 Feb 2006

The Sockets API lets you develop client and server applications that can communicate across a local network or across the world via the Internet. Like any API, you can use the Sockets API in ways that promote high performance -- or inhibit it. This article explores four ways to use the Sockets API to squeeze the greatest performance out your application and to tune the GNU/Linux® environment to achieve the best results. Editor's note: we updated Tip 3 to correct an error in the calculation for Bandwidth Delay Product (BDP), spotted by an alert reader.

When developing a sockets application, job number one is usually establishing reliability and meeting the necessary requirements. With the four tips in this article, you can design and develop your sockets application for best performance, right from the beginning. This article covers use of the Sockets API, a couple of socket options that provide enhanced performance, and GNU/Linux tuning.

To develop applications with lively performance capabilities, follow these tips:

  • Minimize packet transmit latency.
  • Minimize system call overhead.
  • Adjust TCP windows for the Bandwidth Delay Product.
  • Dynamically tune the GNU/Linux TCP/IP stack.

Tip 1. Minimize packet transmit latency

When you communicate through a TCP socket, the data are chopped into blocks so that they fit within the TCP payload for the given connection. The size of TCP payload depends on several factors (such as the maximum packet size along the path), but these factors are known at connection initiation time. To achieve the best performance, the goal is to fill each packet as much as possible with the available data. When insufficient data exist to fill a payload (otherwise known as the maximum segment size, or MSS), TCP employs the Nagle algorithm to automatically concatenate small buffers into a single segment. Doing so increases the efficiency of the application and reduces overall network congestion by minimizing the number of small packets that are sent.

John Nagle's algorithm works well to minimize small packets by concatenating them into larger ones, but sometimes you simply want the ability to send small packets. A simple example is the telnet application, which allows a user to interact with a remote system, typically through a shell. If the user were required to fill a segment with typed characters before the packet was sent, the experience would be less than desirable.

Another example is the HTTP protocol. Commonly, a client browser makes a small request (an HTTP request message), resulting in a much larger response by the Web server (the Web page).

The solution

The first thing you should consider is that the Nagle algorithm fulfills a need. Because the algorithm coalesces data to try to fill a complete TCP packet segment, it does introduce some latency. But it does this with the benefit of minimizing the number of packets sent on the wire, and so it minimizes congestion on the network.

But in cases where you need to minimize that transmit latency, the Sockets API provides a solution. To disable the Nagle algorithm, you can set the TCP_NODELAY socket option, as shown in Listing 1.


Listing 1. Disabling the Nagle algorithm for a TCP socket
int sock, flag, ret;
                        /* Create new stream socket */
                        sock = socket( AF_INET, SOCK_STREAM, 0 );
                        /* Disable the Nagle (TCP No Delay) algorithm */
                        flag = 1;
                        ret = setsockopt( sock, IPPROTO_TCP, TCP_NODELAY, (char *)&flag, sizeof(flag) );
                        if (ret == -1) {
                        printf("Couldn't setsockopt(TCP_NODELAY)\n");
                        exit( EXIT_FAILURE );
                        }
                        

Bonus tip: Experimentation with Samba demonstrates that disabling the Nagle algorithm results in almost doubling the read performance when reading from a Samba drive on a Microsoft® Windows® server.




Tip 2. Minimize system call overhead

Whenever you read or write data to a socket, you're using a system call. This call (such as read or write) crosses the boundary of the user space application to the kernel. Additionally, prior to getting to the kernel, your call goes through the C library to a common function in the kernel (system_call()). From system_call(), your call gets to the filesystem layer, where the kernel determines what type of device you're dealing with. Eventually, your call gets to the sockets layer, where data are read or queued for transmission on the socket (involving a data copy).

This process illustrates that the system call operates not just in the application and kernel domains but through many levels within each domain. The process is expensive, so the more calls you make, the more time you spend working through this call chain, and the less performance you get from your application.

Because you can't avoid making these system calls, your only option is to minimize the number of times you do it. Fortunately, you have control over this process.

The solution

When writing data to a socket, write all the data that you have available instead of performing multiple writes of the data. For reads, pass in the largest buffer that you can support since the kernel will try to fill the entire buffer if enough data exist (in addition to keeping TCP's advertised window open). In this way, you can minimize the number of calls you make and achieve better overall performance. The sendfile system call is also useful for large data transfers, but the TCP_CORK socket option should be set in this case. The writev system call can also be used for bulk transfer as well as the asynchronous IO API (aio_read, aio_write, etc.).




Tip 3. Adjust TCP windows for the Bandwidth Delay Product

TCP depends on several factors for performance. Two of the most important are the link bandwidth (the rate at which packets can be transmitted on the network) and the round-trip time, or RTT (the delay between a segment being sent and its acknowledgment from the peer). These two values determine what is called the Bandwidth Delay Product (BDP).

Given the link bandwidth rate and the RTT, you can calculate the BDP, but what does this do for you? It turns out that the BDP gives you an easy way to calculate the theoretical optimal TCP socket buffer sizes (which hold both the queued data awaiting transmission and queued data awaiting receipt by the application). If the buffer is too small, the TCP window cannot fully open, and this limits performance. If it's too large, precious memory resources can be wasted. If you set the buffer just right, you can fully utilize the available bandwidth. Let's look at an example:

BDP = link_bandwidth * RTT

If your application communicates over a 100Mbps local area network with a 50 ms RTT, the BDP is:

100MBps * 0.050 sec / 8 = 0.625MB = 625KB

Note: I divide by 8 to convert from bits to bytes communicated.

So, set your TCP window to the BDP, or 625KB. But the default window for TCP on Linux 2.6 is 110KB, which limits your bandwidth for the connection to 2.2MBps, as I've calculated here:

throughput = window_size / RTT

110KB / 0.050 = 2.2MBps

If instead you use the window size calculated above, you get a whopping 12.5MBps, as shown here:

625KB / 0.050 = 12.5MBps

That's quite a difference and will provide greater throughput for your socket. So you now know how to calculate the optimal socket buffer size for your socket. But how do you make this change?

The solution

The Sockets API provides several socket options, two of which exist to change the socket send and receive buffer sizes. Listing 2 shows how to adjust the size of the socket send and receive buffers with the SO_SNDBUF and SO_RCVBUF options.

Note: Although the socket buffer size determines the size of the advertised TCP window, TCP also maintains a congestion window within the advertised window. Therefore, because of congestion, a given socket may never utilize the maximum advertised window.


Listing 2. Manually setting the send and receive socket buffer sizes
int ret, sock, sock_buf_size;
                        sock = socket( AF_INET, SOCK_STREAM, 0 );
                        sock_buf_size = BDP;
                        ret = setsockopt( sock, SOL_SOCKET, SO_SNDBUF,
                        (char *)&sock_buf_size, sizeof(sock_buf_size) );
                        ret = setsockopt( sock, SOL_SOCKET, SO_RCVBUF,
                        (char *)&sock_buf_size, sizeof(sock_buf_size) );
                        

Within the Linux 2.6 kernel, the window size for the send buffer is taken as defined by the user in the call, but the receive buffer is doubled automatically. You can verify the size of each buffer using the getsockopt call.

Jumbo frames

Also consider increasing the packet size from 1,500 to 9,000 bytes (known as a jumbo frame). This can be done in local network situations by setting the Maximum Transmit Unit (or MTU) and can really boost performance. While great for LANs, it can sometimes be problematic in WANs because intermediary equipment such as switches may not support it. The MTU can be modified using the ifconfig utility.

As for window scaling, TCP originally supported a maximum 64KB window (16 bits were used to define the window size). With the inclusion of window scaling (per RFC 1323), you can use a 32-bit value to represent the size of the window. The TCP/IP stack provided in GNU/Linux supports this option (and many others).

Bonus tip: The Linux kernel also includes the ability to auto-tune these socket buffers (see tcp_rmem and tcp_wmem in Table 1 below), but these options affect the entire stack. If you need to adjust the window for only one connection or type of connection, this mechanism does what you need.

Tip 4. Dynamically tune the GNU/Linux TCP/IP stack

A standard GNU/Linux distribution tries to optimize for a wide range of deployments. This means that the standard distribution might not be optimal for your environment.

The solution

GNU/Linux provides a wide range of tunable kernel parameters that you can use to dynamically tailor the operating system for your specific use. Let's look at some of the more important options that affect sockets performance.

The tunable kernel parameters exist within the /proc virtual filesystem. Each file in this filesystem represents one or more parameters that can be read through the cat utility or modified with the echo command. Listing 3 shows how to query and enable a tunable parameter (in this case, enabling IP forwarding within the TCP/IP stack).


Listing 3. Tuning: Enable IP forwarding within the TCP/IP stack
[root@camus]# cat /proc/sys/net/ipv4/ip_forward
                        0
                        [root@camus]# echo "1" > /proc/sys/net/ipv4/ip_forward
                        [root@camus]# cat /proc/sys/net/ipv4/ip_forward
                        1
                        [root@camus]#
                        

Table 1 is a list of several tunable parameters that can help you increase the performance of the Linux TCP/IP stack.

Table 1. Kernel tunable parameters for TCP/IP stack performance
Tunable parameter Default value Option description
/proc/sys/net/core/rmem_default "110592" Defines the default receive window size; for a large BDP, the size should be larger.
/proc/sys/net/core/rmem_max "110592" Defines the maximum receive window size; for a large BDP, the size should be larger.
/proc/sys/net/core/wmem_default "110592" Defines the default send window size; for a large BDP, the size should be larger.
/proc/sys/net/core/wmem_max "110592" Defines the maximum send window size; for a large BDP, the size should be larger.
/proc/sys/net/ipv4/tcp_window_scaling "1" Enables window scaling as defined by RFC 1323; must be enabled to support windows larger than 64KB.
/proc/sys/net/ipv4/tcp_sack "1" Enables selective acknowledgment, which improves performance by selectively acknowledging packets received out of order (causing the sender to retransmit only the missing segments); should be enabled (for wide area network communication), but it can increase CPU utilization.
/proc/sys/net/ipv4/tcp_fack "1" Enables Forward Acknowledgment, which operates with Selective Acknowledgment (SACK) to reduce congestion; should be enabled.
/proc/sys/net/ipv4/tcp_timestamps "1" Enables calculation of RTT in a more accurate way (see RFC 1323) than the retransmission timeout; should be enabled for performance.
/proc/sys/net/ipv4/tcp_mem "24576 32768 49152" Determines how the TCP stack should behave for memory usage; each count is in memory pages (typically 4KB). The first value is the low threshold for memory usage. The second value is the threshold for a memory pressure mode to begin to apply pressure to buffer usage. The third value is the maximum threshold. At this level, packets can be dropped to reduce memory usage. Increase the count for large BDP (but remember, it's memory pages, not bytes).
/proc/sys/net/ipv4/tcp_wmem "4096 16384 131072" Defines per-socket memory usage for auto-tuning. The first value is the minimum number of bytes allocated for the socket's send buffer. The second value is the default (overridden by wmem_default) to which the buffer can grow under non-heavy system loads. The third value is the maximum send buffer space (overridden by wmem_max).
/proc/sys/net/ipv4/tcp_rmem "4096 87380 174760" Same as tcp_wmem except that it refers to receive buffers for auto-tuning.
/proc/sys/net/ipv4/tcp_low_latency "0" Allows the TCP/IP stack to give deference to low latency over higher throughput; should be disabled.
/proc/sys/net/ipv4/tcp_westwood "0" Enables a sender-side congestion control algorithm that maintains estimates of throughput and tries to optimize the overall utilization of bandwidth; should be enabled for WAN communication. This option is also useful for wireless interfaces, as packet loss may not be caused by congestion.
/proc/sys/net/ipv4/tcp_bic "1" Enables Binary Increase Congestion for fast long-distance networks; permits better utilization of links operating at gigabit speeds; should be enabled for WAN communication.

As with any tuning effort, the best approach is experimental in nature. Your application behavior, processor speed, and availability of memory all affect how these parameters will alter performance. In some cases, what you think should be beneficial can be detrimental (and vice versa). So, try an option and then check the result. In other words, trust but verify.

Bonus tip: A word about persistent configuration. Note that if you reboot a GNU/Linux system, any tunable kernel parameters that you changed revert to their default. To make yours the default parameter, use the file /etc/sysctl.conf to configure the parameters at boot-time for your configuration.




GNU/Linux tools

GNU/Linux is attractive to me because of the number of tools that are available. The vast majority are command-line tools, but they are amazingly useful and intuitive. GNU/Linux provides several tools -- either natively or available as open source -- to debug networking applications, measure bandwidth/throughput, and check link utilization.

Table 2 lists some of the most useful GNU/Linux tools along with their intended use. Table 3 lists useful tools that are not typically part of GNU/Linux distributions.

Table 2. Native tools commonly found in any GNU/Linux distribution
GNU/Linux utility Purpose
ping Most commonly used to check accessibility to a host but can also be used to identify the RTT for the bandwidth-delay-product calculation.
traceroute Prints the path (route) for a connection to a network host through a series of routers and gateways, identifying the latency between each hop.
netstat Identifies various statistics about the networking subsystem, protocols, and connections.
tcpdump Shows the protocol-level packet trace for one or more connections; also includes timing information, which you can use to explore the packet timing of the various protocol services.

Table 3. Useful performance tools not typically available in a GNU/Linux distribution
GNU/Linux utility Purpose
netlog Provides application instrumentation for network performance.
nettimer Generates a metric for bottleneck link bandwidth; can be used for protocol auto-tuning.
Ethereal Provides the features of tcpump (packet trace) in an easy-to-use graphical interface.
iperf Measures network performance for both TCP and UDP; measures maximum bandwidth, and also reports delay jitter and datagram loss.
trafshow Provides full-screen visualization of network traffic.




Conclusion

Experiment with these tips and techniques to increase the performance of your sockets applications, including reducing transmit latency by disabling the Nagle algorithm, increasing bandwidth utilization of a socket through buffer sizing, reducing system call overhead by minimizing the number of system calls, and tuning the Linux TCP/IP stack with tunable kernel parameters.

Always consider the nature of your application when tuning. For example, is your application LAN-based or will it communicate over the Internet? If your application operates only within a LAN, increasing socket-buffer sizes may not yield much benefit, but enabling jumbo frames certainly will!

Finally, always check the results of your tuning with a tool like tcpdump or Ethereal. The changes you see at the packet level will help indicate the success of your tuning with these techniques.