Write Scalable Winsock Apps Using Completion Ports

异步IO、APC、IO完成端口、线程池与高性能服务器
kraft @ 2005-11-13 18:37

原作者姓名 Fang(fangguicheng@21cn.com)

异步IO、APC、IO完成端口、线程池与高性能服务器之一异步IO

背景：轮询 PIO DMA 中断

    早期IO设备的速度与CPU相比，还不是太悬殊。CPU定时轮询一遍IO设备，看看有无处理要求，有则加以处理，完成后返回继续工作。至今，软盘驱动器还保留着这种轮询工作方式。
    随着CPU性能的迅速提高，这种效率低下的工作方式浪费了大量的CPU时间。因此，中断工作方式开始成为普遍采用的技术。这种技术使得IO设备在需要得到服务时，能够产生一个硬件中断，迫使CPU放弃目前的处理任务，进入特定的中断服务过程，中断服务完成后，再继续原先的处理。这样一来，IO设备和CPU可以同时进行处理，从而避免了CPU等待IO完成。
    早期数据的传输方式主要是PIO（程控IO）方式。通过对IO地址编程方式的方式来传输数据。比如串行口，软件每次往串行口上写一个字节数据，串口设备完成传输任务后，将会产生一个中断，然后软件再次重复直到全部数据发送完成。性能更好的硬件设备提供一个FIFO（先进先出缓冲部件），可以让软件一次传输更多的字节。
    显然，使用PIO方式对于高速IO设备来说，还是占用了太多的CPU时间（因为需要通过CPU编程控制传输）。而DMA（直接内存访问）方式能够极大地减少CPU处理时间。CPU仅需告诉DMA控制器数据块的起始地址和大小，以后DMA控制器就可以自行在内存和设备之间传输数据，其间CPU可以处理其他任务。数据传输完毕后将会产生一个中断。

同步文件IO和异步文件IO

下面摘抄于MSDN《synchronous file I/O and asynchronous file I/O》。
有两种类型的文件IO同步：同步文件IO和异步文件IO。异步文件IO也就是重叠IO。
在同步文件IO中，线程启动一个IO操作然后就立即进入等待状态，直到IO操作完成后才醒来继续执行。而异步文件IO方式中，线程发送一个IO请求到内核，然后继续处理其他的事情，内核完成IO请求后，将会通知线程IO操作完成了。

如果IO请求需要大量时间执行的话，异步文件IO方式可以显著提高效率，因为在线程等待的这段时间内，CPU将会调度其他线程进行执行，如果没有其他线程需要执行的话，这段时间将会浪费掉（可能会调度操作系统的零页线程）。如果IO请求操作很快，用异步IO方式反而还低效，还不如用同步IO方式。
    同步IO在同一时刻只允许一个IO操作，也就是说对于同一个文件句柄的IO操作是序列化的，即使使用两个线程也不能同时对同一个文件句柄同时发出读写操作。重叠IO允许一个或多个线程同时发出IO请求。
    异步IO在请求完成时，通过将文件句柄设为有信号状态来通知应用程序，或者应用程序通过GetOverlappedResult察看IO请求是否完成，也可以通过一个事件对象来通知应用程序。

参考书目
1，    MSDN Library
2，    《Windows高级编程指南》
3，    《Windows核心编程》
4，    《Windows 2000 设备驱动程序设计指南》

异步IO、APC、IO完成端口、线程池与高性能服务器之二 APC

    Alertable IO(告警IO)提供了更有效的异步通知形式。ReadFileEx / WriteFileEx在发出IO请求的同时，提供一个回调函数（APC过程），当IO请求完成后，一旦线程进入可告警状态，回调函数将会执行。
    以下五个函数能够使线程进入告警状态：
    SleepEx
    WaitForSingleObjectEx
    WaitForMultipleObjectsEx
    SignalObjectAndWait
    MsgWaitForMultipleObjectsEx
    线程进入告警状态时，内核将会检查线程的APC队列，如果队列中有APC，将会按FIFO方式依次执行。如果队列为空，线程将会挂起等待事件对象。以后的某个时刻，一旦APC进入队列，线程将会被唤醒执行APC，同时等待函数返回WAIT_IO_COMPLETION。
    QueueUserAPC可以用来人为投递APC，只要目标线程处于告警状态时，APC就能够得到执行。
    使用告警IO的主要缺点是发出IO请求的线程也必须是处理结果的线程，如果一个线程退出时还有未完成的IO请求，那么应用程序将永远丢失IO完成通知。然而以后我们将会看到IO完成端口没有这个限制。
    下面的代码演示了QueueUserAPC的用法。

/************************************************************************/
/* APC Test.                                                            */
/************************************************************************/

DWORD WINAPI WorkThread(PVOID pParam)
{
    HANDLE Event = (HANDLE)pParam;

    for(;;)
    {
        DWORD dwRet = WaitForSingleObjectEx(Event, INFINITE, TRUE);
        if(dwRet == WAIT_OBJECT_0)
            break;
        else if(dwRet == WAIT_IO_COMPLETION)
            printf("WAIT_IO_COMPLETION\n");
    }

    return 0;
}

VOID CALLBACK APCProc(DWORD dwParam)
{
    printf("%s", (PVOID)dwParam);
}

void TestAPC(BOOL bFast)
{
    HANDLE QuitEvent = CreateEvent(NULL, FALSE, FALSE, NULL);

    HANDLE hThread = CreateThread(NULL,
        0,
        WorkThread,
        (PVOID)QuitEvent,
        0,
        NULL);

    Sleep(100); // Wait for WorkThread initialized.

    for(int i=5; i>0; i--)
    {
        QueueUserAPC(APCProc, hThread, (DWORD)(PVOID)"APC here\n");

        if(!bFast)
            Sleep(1000);
    }

    SetEvent(QuitEvent);
    WaitForSingleObject(hThread, INFINITE);
    CloseHandle(hThread);
}

参考书目

1，    MSDN Library
2，    《Windows高级编程指南》
3，    《Windows核心编程》
4，    《Windows 2000 设备驱动程序设计指南》

异步IO、APC、IO完成端口、线程池与高性能服务器之三 IO完成端口

IO完成端口

下面摘抄于MSDN《I/O Completion Ports》，smallfool翻译，原文请参考CSDN文档中心文章《I/O Completion Ports》， http://dev.csdn.net/Develop/article/29%5C29240.shtm 。
I/O完成端口是一种机制，通过这个机制，应用程序在启动时会首先创建一个线程池，然后该应用程序使用线程池处理异步I/O请求。这些线程被创建的唯一目的就是用于处理I/O请求。对于处理大量并发异步I/O请求的应用程序来说，相比于在I/O请求发生时创建线程来说，使用完成端口(s)它就可以做的更快且更有效率。
CreateIoCompletionPort函数会使一个I/O完成端口与一个或多个文件句柄发生关联。当与一个完成端口相关的文件句柄上启动的异步I/O操作完成时，一个I/O完成包就会进入到该完成端口的队列中。对于多个文件句柄来说，这种机制可以用来把多文件句柄的同步点放在单个对象中。（言下之意，如果我们需要对每个句柄文件进行同步，一般而言我们需要多个对象（如：Event来同步），而我们使用IO Complete Port 来实现异步操作，我们可以同多个文件相关联，每当一个文件中的异步操作完成，就会把一个complete package放到队列中，这样我们就可以使用这个来完成所有文件句柄的同步）
调用GetQueuedCompletionStatus函数，某个线程就会等待一个完成包进入到完成端口的队列中，而不是直接等待异步I/O请求完成。线程（们）就会阻塞于它们的运行在完成端口（按照后进先出队列顺序的被释放）。这就意味着当一个完成包进入到完成端口的队列中时，系统会释放最近被阻塞在该完成端口的线程。
调用GetQueuedCompletionStatus，线程就会将会与某个指定的完成端口建立联系，一直延续其该线程的存在周期，或被指定了不同的完成端口，或者释放了与完成端口的联系。一个线程只能与最多不超过一个的完成端口发生联系。
完成端口最重要的特性就是并发量。完成端口的并发量可以在创建该完成端口时指定。该并发量限制了与该完成端口相关联的可运行线程的数目。当与该完成端口相关联的可运行线程的总数目达到了该并发量，系统就会阻塞任何与该完成端口相关联的后续线程的执行，直到与该完成端口相关联的可运行线程数目下降到小于该并发量为止。最有效的假想是发生在有完成包在队列中等待，而没有等待被满足，因为此时完成端口达到了其并发量的极限。此时，一个正在运行中的线程调用GetQueuedCompletionStatus时，它就会立刻从队列中取走该完成包。这样就不存在着环境的切换，因为该处于运行中的线程就会连续不断地从队列中取走完成包，而其他的线程就不能运行了。
对于并发量最好的挑选值就是您计算机中CPU的数目。如果您的事务处理需要一个漫长的计算时间，一个比较大的并发量可以允许更多线程来运行。虽然完成每个事务处理需要花费更长的时间，但更多的事务可以同时被处理。对于应用程序来说，很容易通过测试并发量来获得最好的效果。
PostQueuedCompletionStatus函数允许应用程序可以针对自定义的专用I/O完成包进行排队，而无需启动一个异步I/O操作。这点对于通知外部事件的工作者线程来说很有用。
在没有更多的引用针对某个完成端口时，需要释放该完成端口。该完成端口句柄以及与该完成端口相关联的所有文件句柄都需要被释放。调用CloseHandle可以释放完成端口的句柄。

下面的代码利用IO完成端口做了一个简单的线程池。

/************************************************************************/
/* Test IOCompletePort.                                                 */
/************************************************************************/

DWORD WINAPI IOCPWorkThread(PVOID pParam)
{
    HANDLE CompletePort = (HANDLE)pParam;
    PVOID UserParam;
    WORK_ITEM_PROC UserProc;
    LPOVERLAPPED pOverlapped;

    for(;;)
    {
        BOOL bRet = GetQueuedCompletionStatus(
            CompletePort,
            (LPDWORD)&UserParam,
            (LPDWORD)&UserProc,
            &pOverlapped,
            INFINITE);

        _ASSERT(bRet);

        if(UserProc == NULL) // Quit signal.
            break;

        // execute user's proc.
        UserProc(UserParam);
    }

    return 0;
}

void TestIOCompletePort(BOOL bWaitMode, LONG ThreadNum)
{
    HANDLE CompletePort;
    OVERLAPPED Overlapped = {0, 0, 0, 0, NULL};

    CompletePort = CreateIoCompletionPort(
        INVALID_HANDLE_VALUE,
        NULL,
        NULL,
        0);

    // Create threads.
    for(int i=0; i<ThreadNum; i++)
    {
        HANDLE hThread = CreateThread(NULL,
            0,
            IOCPWorkThread,
            CompletePort,
            0,
            NULL);

        CloseHandle(hThread);
    }

    CompleteEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
    BeginTime = GetTickCount();
    ItemCount = 20;

    for(i=0; i<20; i++)
    {
        PostQueuedCompletionStatus(
            CompletePort,
            (DWORD)bWaitMode,
            (DWORD)UserProc1,
            &Overlapped);
    }

    WaitForSingleObject(CompleteEvent, INFINITE);
    CloseHandle(CompleteEvent);

    // Destroy all threads.
    for(i=0; i<ThreadNum; i++)
    {
        PostQueuedCompletionStatus(
            CompletePort,
            NULL,
            NULL,
            &Overlapped);
    }

    Sleep(1000); // wait all thread exit.

    CloseHandle(CompletePort);
}

参考书目

1，    MSDN Library
2，    《Windows高级编程指南》
3，    《Windows核心编程》
4，    《Windows 2000 设备驱动程序设计指南》

异步IO、APC、IO完成端口、线程池与高性能服务器之四线程池

线程池

下面摘抄于MSDN《Thread Pooling》。
有许多应用程序创建的线程花费了大量时间在睡眠状态来等待事件的发生。还有一些线程进入睡眠状态后定期被唤醒以轮询工作方式来改变或者更新状态信息。线程池可以让你更有效地使用线程，它为你的应用程序提供一个由系统管理的工作者线程池。至少会有一个线程来监听放到线程池的所有等待操作，当等待操作完成后，线程池中将会有一个工作者线程来执行相应的回调函数。
你也可以把没有等待操作的工作项目放到线程池中，用QueueUserWorkItem函数来完成这个工作，把要执行的工作项目函数通过一个参数传递给线程池。工作项目被放到线程池中后，就不能再取消了。
Timer-queue timers和Registered wait operations也使用线程池来实现。他们的回调函数也放在线程池中。你也可以用BindIOCompletionCallback函数来投递一个异步IO操作，在IO完成端口上，回调函数也是由线程池线程来执行。
当第一次调用QueueUserWorkItem函数或者BindIOCompletionCallback函数的时候，线程池被自动创建，或者Timer-queue timers或者Registered wait operations放入回调函数的时候，线程池也可以被创建。线程池可以创建的线程数量不限，仅受限于可用的内存，每一个线程使用默认的初始堆栈大小，运行在默认的优先级上。
线程池中有两种类型的线程：IO线程和非IO线程。IO线程等待在可告警状态，工作项目作为APC放到IO线程中。如果你的工作项目需要线程执行在可警告状态，你应该将它放到IO线程。
非IO工作者线程等待在IO完成端口上，使用非IO线程比IO线程效率更高，也就是说，只要有可能的话，尽量使用非IO线程。IO线程和非IO线程在异步IO操作没有完成之前都不会退出。然而，不要在非IO线程中发出需要很长时间才能完成的异步IO请求。
正确使用线程池的方法是，工作项目函数以及它将会调用到的所有函数都必须是线程池安全的。安全的函数不应该假设线程是一次性线程的或者是永久线程。一般来说，应该避免使用线程本地存储和发出需要永久线程的异步IO调用，比如说RegNotifyChangeKeyValue函数。如果需要在永久线程中执行这样的函数的话，可以给QueueUserWorkItem传递一个选项WT_EXECUTEINPERSISTENTTHREAD。
注意，线程池不能兼容COM的单线程套间（STA）模型。

    为了更深入地讲解操作系统实现的线程池的优越性，我们首先尝试着自己实现一个简单的线程池模型。

    代码如下：

/************************************************************************/
/* Test Our own thread pool.                                            */
/************************************************************************/

typedef struct _THREAD_POOL
{
    HANDLE QuitEvent;
    HANDLE WorkItemSemaphore;

    LONG WorkItemCount;
    LIST_ENTRY WorkItemHeader;
    CRITICAL_SECTION WorkItemLock;

    LONG ThreadNum;
    HANDLE *ThreadsArray;

}THREAD_POOL, *PTHREAD_POOL;

typedef VOID (*WORK_ITEM_PROC)(PVOID Param);

typedef struct _WORK_ITEM
{
    LIST_ENTRY List;

    WORK_ITEM_PROC UserProc;
    PVOID UserParam;

}WORK_ITEM, *PWORK_ITEM;

DWORD WINAPI WorkerThread(PVOID pParam)
{
    PTHREAD_POOL pThreadPool = (PTHREAD_POOL)pParam;
    HANDLE Events[2];

    Events[0] = pThreadPool->QuitEvent;
    Events[1] = pThreadPool->WorkItemSemaphore;

    for(;;)
    {
        DWORD dwRet = WaitForMultipleObjects(2, Events, FALSE, INFINITE);

        if(dwRet == WAIT_OBJECT_0)
            break;

        //
        // execute user's proc.
        //

        else if(dwRet == WAIT_OBJECT_0 +1)
        {
            PWORK_ITEM pWorkItem;
            PLIST_ENTRY pList;

            EnterCriticalSection(&pThreadPool->WorkItemLock);
            _ASSERT(!IsListEmpty(&pThreadPool->WorkItemHeader));
            pList = RemoveHeadList(&pThreadPool->WorkItemHeader);
            LeaveCriticalSection(&pThreadPool->WorkItemLock);

            pWorkItem = CONTAINING_RECORD(pList, WORK_ITEM, List);
            pWorkItem->UserProc(pWorkItem->UserParam);

            InterlockedDecrement(&pThreadPool->WorkItemCount);
            free(pWorkItem);
        }

        else
        {
            _ASSERT(0);
            break;
        }
    }

    return 0;
}

BOOL InitializeThreadPool(PTHREAD_POOL pThreadPool, LONG ThreadNum)
{
    pThreadPool->QuitEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
    pThreadPool->WorkItemSemaphore = CreateSemaphore(NULL, 0, 0x7FFFFFFF, NULL);
    pThreadPool->WorkItemCount = 0;
    InitializeListHead(&pThreadPool->WorkItemHeader);
    InitializeCriticalSection(&pThreadPool->WorkItemLock);
    pThreadPool->ThreadNum = ThreadNum;
    pThreadPool->ThreadsArray = (HANDLE*)malloc(sizeof(HANDLE) * ThreadNum);

    for(int i=0; i<ThreadNum; i++)
    {
        pThreadPool->ThreadsArray[i] = CreateThread(NULL, 0, WorkerThread, pThreadPool, 0, NULL);
    }

    return TRUE;
}

VOID DestroyThreadPool(PTHREAD_POOL pThreadPool)
{
    SetEvent(pThreadPool->QuitEvent);

    for(int i=0; i<pThreadPool->ThreadNum; i++)
    {
        WaitForSingleObject(pThreadPool->ThreadsArray[i], INFINITE);
        CloseHandle(pThreadPool->ThreadsArray[i]);
    }

    free(pThreadPool->ThreadsArray);

    CloseHandle(pThreadPool->QuitEvent);
    CloseHandle(pThreadPool->WorkItemSemaphore);
    DeleteCriticalSection(&pThreadPool->WorkItemLock);

    while(!IsListEmpty(&pThreadPool->WorkItemHeader))
    {
        PWORK_ITEM pWorkItem;
        PLIST_ENTRY pList;

        pList = RemoveHeadList(&pThreadPool->WorkItemHeader);
        pWorkItem = CONTAINING_RECORD(pList, WORK_ITEM, List);

        free(pWorkItem);
    }
}

BOOL PostWorkItem(PTHREAD_POOL pThreadPool, WORK_ITEM_PROC UserProc, PVOID UserParam)
{
    PWORK_ITEM pWorkItem = (PWORK_ITEM)malloc(sizeof(WORK_ITEM));
    if(pWorkItem == NULL)
        return FALSE;

    pWorkItem->UserProc = UserProc;
    pWorkItem->UserParam = UserParam;

    EnterCriticalSection(&pThreadPool->WorkItemLock);
    InsertTailList(&pThreadPool->WorkItemHeader, &pWorkItem->List);
    LeaveCriticalSection(&pThreadPool->WorkItemLock);

    InterlockedIncrement(&pThreadPool->WorkItemCount);

    ReleaseSemaphore(pThreadPool->WorkItemSemaphore, 1, NULL);

    return TRUE;
}

VOID UserProc1(PVOID dwParam)
{
    WorkItem(dwParam);
}

void TestSimpleThreadPool(BOOL bWaitMode, LONG ThreadNum)
{
    THREAD_POOL ThreadPool;
    InitializeThreadPool(&ThreadPool, ThreadNum);

    CompleteEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
    BeginTime = GetTickCount();
    ItemCount = 20;

    for(int i=0; i<20; i++)
    {
        PostWorkItem(&ThreadPool, UserProc1, (PVOID)bWaitMode);
    }

    WaitForSingleObject(CompleteEvent, INFINITE);
    CloseHandle(CompleteEvent);

    DestroyThreadPool(&ThreadPool);
}
    我们把工作项目放到一个队列中，用一个信号量通知线程池，线程池中任意一个线程取出工作项目来执行，执行完毕之后，线程返回线程池，继续等待新的工作项目。
    线程池中线程的数量是固定的，预先创建好的，永久的线程，直到销毁线程池的时候，这些线程才会被销毁。
    线程池中线程获得工作项目的机会是均等的，随机的，并没有特别的方式保证哪一个线程具有特殊的优先获得工作项目的机会。
    而且，同一时刻可以并发运行的线程数目没有任何限定。事实上，在我们的执行计算任务的演示代码中，所有的线程都并发执行。
    下面，我们再来看一下，完成同样的任务，系统提供的线程池是如何运作的。

/************************************************************************/
/* QueueWorkItem Test.                                                  */
/************************************************************************/

DWORD BeginTime;
LONG  ItemCount;
HANDLE CompleteEvent;

int compute()
{
    srand(BeginTime);

    for(int i=0; i<20 *1000 * 1000; i++)
        rand();

    return rand();
}

DWORD WINAPI WorkItem(LPVOID lpParameter)
{
    BOOL bWaitMode = (BOOL)lpParameter;

    if(bWaitMode)
        Sleep(1000);
    else
        compute();

    if(InterlockedDecrement(&ItemCount) == 0)
    {
        printf("Time total %d second.\n", GetTickCount() - BeginTime);
        SetEvent(CompleteEvent);
    }

    return 0;
}

void TestWorkItem(BOOL bWaitMode, DWORD Flag)
{
    CompleteEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
    BeginTime = GetTickCount();
    ItemCount = 20;

    for(int i=0; i<20; i++)
    {
        QueueUserWorkItem(WorkItem, (PVOID)bWaitMode, Flag);
    }

    WaitForSingleObject(CompleteEvent, INFINITE);
    CloseHandle(CompleteEvent);
}
    很简单，是吧？我们仅需要关注于我们的回调函数即可。但是与我们的简单模拟来比，系统提供的线程池有着更多的优点。
    首先，线程池中线程的数目是动态调整的，其次，线程池利用IO完成端口的特性，它可以限制并发运行的线程数目，默认情况下，将会限制为CPU的数目，这可以减少线程切换。它挑选最近执行过的线程再次投入执行，从而避免了不必要的线程切换。
    系统提供的线程池背后的策略，我们下一节继续再谈。

参考书目

1，    MSDN Library
2，    《Windows高级编程指南》
3，    《Windows核心编程》
4，    《Windows 2000 设备驱动程序设计指南》

正文
异步IO、APC、IO完成端口、线程池与高性能服务器之五服务器的性能指标与实现高性能的途径

服务器的性能指标

    作为一个网络服务器程序，性能永远是第一位的指标。性能可以这样定义：在给定的硬件条件和时间里，能够处理的任务量。能够最大限度地利用硬件性能的服务器设计才是良好的设计。
    设计良好的服务器还应该考虑平均服务，对于每一个客户端，服务器应该给予每个客户端平均的服务，不能让某一个客户端长时间得不到服务而发生“饥饿”的状况。
    可伸缩性，也就是说，随着硬件能力的提高，服务器的性能能够随之呈线性增长。

实现高性能的途径

    一个实际的服务器的计算是很复杂的，往往是混合了IO计算和CPU计算。IO计算指计算任务中以IO为主的计算模型，比如文件服务器、邮件服务器等，混合了大量的网络IO和文件IO；CPU计算指计算任务中没有或很少有IO，比如加密/解密，编码/解码，数学计算等等。
    在CPU计算中，单线程和多线程模型效果是相当的。《Win32多线程的性能》中说“在一个单处理器的计算机中，基于 CPU 的任务的并发执行速度不可能比串行执行速度快，但是我们可以看到，在 Windows NT 下线程创建和切换的额外开销非常小；对于非常短的计算，并发执行仅仅比串行执行慢 10%，而随着计算长度的增加，这两个时间就非常接近了。”
    可见，对于纯粹的CPU计算来说，如果只有一个CPU，多线程模型是不合适的。考虑一个执行密集的CPU计算的服务，如果有几十个这样的线程并发执行，过于频繁地任务切换导致了不必要的性能损失。
    在编程实现上，单线程模型计算模型对于服务器程序设计是很不方便的。因此，对于CPU计算采用线程池工作模型是比较恰当的。QueueUserWorkItem函数非常适合于将一个CPU计算放入线程池。线程池实现将会努力减少这种不必要的线程切换，而且控制并发线程的数目为CPU的数目。
    我们真正需要关心的是IO计算，一般的网络服务器程序往往伴随着大量的IO计算。提高性能的途径在于要避免等待IO 的结束，造成CPU空闲，要尽量利用硬件能力，让一个或多个IO设备与CPU并发执行。前面介绍的异步IO，APC，IO完成端口都可以达到这个目的。
    对于网络服务器来说，如果客户端并发请求数目比较少的话，用简单的多线程模型就可以应付了。如果一个线程因为等待IO操作完成而被挂起，操作系统将会调度另外一个就绪的线程投入运行，从而形成并发执行。经典的网络服务器逻辑大多采用多线程/多进程方式，在一个客户端发起到服务器的连接时，服务器将会创建一个线程，让这个新的线程来处理后续事务。这种以一个专门的线程/进程来代表一个客户端对象的编程方法非常直观，易于理解。
    对于大型网络服务器程序来说，这种方式存在着局限性。首先，创建线程/进程和销毁线程/进程的代价非常高昂，尤其是在服务器采用TCP“短连接”方式或UDP方式通讯的情况下，例如，HTTP协议中，客户端发起一个连接后，发送一个请求，服务器回应了这个请求后，连接也就被关闭了。如果采用经典方式设计HTTP服务器，那么过于频繁地创建线程/销毁线程对性能造成的影响是很恶劣的。
    其次，即使一个协议中采取TCP“长连接”，客户端连上服务器后就一直保持此连接，经典的设计方式也是有弊病的。如果客户端并发请求量很高，同一时刻有很多客户端等待服务器响应的情况下，将会有过多的线程并发执行，频繁的线程切换将用掉一部分计算能力。实际上，如果并发线程数目过多的话，往往会过早地耗尽物理内存，绝大部分时间耗费在线程切换上，因为线程切换的同时也将引起内存调页。最终导致服务器性能急剧下降，
    对于一个需要应付同时有大量客户端并发请求的网络服务器来说，线程池是唯一的解决方案。线程池不光能够避免频繁地创建线程和销毁线程，而且能够用数目很少的线程就可以处理大量客户端并发请求。
    值得注意的是，对于一个压力不大的网络服务器程序设计，我们并不推荐以上任何技巧。在简单的设计就能够完成任务的情况下，把事情弄得很复杂是很不明智，很愚蠢的行为。

Windows Sockets 2.0: Write Scalable Winsock Apps Using Completion Ports

Anthony Jones and Amol Deshpande

This article assumes you're familiar with Winsock API, TCP/IP, Win32 API

Level of Difficulty    1   2   3

Download the code for this article: Jones1000.exe (33KB)

SUMMARY Writing a network-aware application isn't difficult, but writing one that is scalable can be challenging. Overlapped I/O using completion ports provides true scalability on Windows NT and Windows 2000. Completion ports and Windows Sockets 2.0 can be used to design applications that will scale to thousands of connections.
The article begins with a discussion of the implementation of a scalable server, discusses handling low-resource, high-demand conditions, and addresses the most common problems with scalability.

earning to write network-aware applications has never been considered easy. In reality, though, there are just a few principles to master—creating and connecting a socket, accepting a connection, and sending and receiving data. The real difficulty is writing network applications that scale from a single connection to many thousands of connections. This article will discuss the development of scalable Windows NT® and Windows 2000-based applications that use Windows® Sockets 2.0 (Winsock2). The primary focus will be the server side of the client/ server model; however, many of the topics discussed apply to both.
Because the notion of writing a scalable Winsock application implies a server application, the following discussion is pertinent to applications running on Windows NT 4.0 and Windows 2000. We're not including Windows NT 3.x because this solution relies on the features of Winsock2 that are available only on Windows NT 4.0 and newer.

APIs and Scalability
The overlapped I/O mechanism in Win32® allows an application to initiate an operation and receive notification of its completion later. This is especially useful for operations that take a long time to complete. The thread that initiates the overlapped operation is then free to do other things while the overlapped request completes behind the scenes. The only I/O model that provides true scalability on Windows NT and Windows 2000 is overlapped I/O using completion ports for notification. Mechanisms like the WSAAsyncSelect and select functions are provided for easier porting from Windows 3.1 and Unix respectively, but are not designed to scale. The completion port mechanism is optimized for the operating system's internal workings.

Completion Ports
A completion port is a queue into which the operating system puts notifications of completed overlapped I/O requests. Once the operation completes, a notification is sent to a worker thread that can process the result. A socket may be associated with a completion port at any point after creation.
Typically an application will also create a number of worker threads to process these notifications. The number of worker threads depends on the specific needs of the application. The ideal number is one per processor, but that implies that none of these threads should execute a blocking operation such as a synchronous read/write or a wait on an event. Each thread is given a certain amount of CPU time, known as the quantum, for which it can execute before another thread is allowed to grab a time slice. If a thread performs a blocking operation, the operating system will throw away its unused time slice and let other threads execute instead. Thus, the first thread has not fully utilized its quantum, and the application should therefore have other threads ready to run and utilize that time slot.
Using a completion port is a two-step process. First, the completion port is created, as shown in the following code:
HANDLE hIocp; hIocp = CreateIoCompletionPort( INVALID_HANDLE_VALUE, NULL, (ULONG_PTR)0, 0); if (hIocp == NULL) { // Error }
Once the completion port is created, each socket that wants to use the completion port must be associated with it. This is done by calling CreateIoCompletionPort again, this time setting the first parameter, FileHandle, to the socket handle to be associated, and setting ExistingCompletionPort to the handle of the completion port you just created.
The following code creates a socket and associates it with the completion port created earlier:
SOCKET s; s = socket(AF_INET, SOCK_STREAM, 0); if (s == INVALID_SOCKET) { // Error if (CreateIoCompletionPort((HANDLE)s, hIocp, (ULONG_PTR)0, 0) == NULL) { // Error } ••• }
At this point, the socket s is associated with the completion port. Any overlapped operations performed on the socket will use the completion port for notification. Note that the third parameter of CreateIoCompletionPort allows a completion key to be specified along with the socket handle to be associated. This can be used to pass context information that is associated with the socket. Each time a completion notification arrives, this context information can be retrieved.
Once the completion port has been created and sockets have been associated with it, one or more threads are needed to process the completion notifications. Each thread will sit in a loop that calls GetQueuedCompletionStatus each time through and returns completion notifications.
Before illustrating what a typical worker thread looks like, we need to address the ways in which an application keeps track of its overlapped operations. When an overlapped call is made, a pointer to an overlapped structure is passed as a parameter. GetQueuedCompletionStatus will return the same pointer when the operation completes. With this structure alone, however, an application can't tell which operation just completed. In order to keep track of the operations that have completed, it's useful to define your own OVERLAPPED structure that contains any extra information about each operation queued to the completion port (see Figure 1).
Whenever an overlapped operation is performed, an OVERLAPPEDPLUS structure is passed as the lpOverlapped parameter (as in WSASend, WSARecv, and so on). This allows you to set operation state information for each overlapped call. When the operation completes, the OVERLAPPED pointer returned from GetQueuedCompletionStatus will now point to your extended structure. Note that the OVERLAPPED field within the extended structure does not necessarily have to be the first field. After the pointer to the OVERLAPPED structure is returned, the CONTAINING_RECORD macro can be used to obtain a pointer to the extended structure.
Take a look at the example worker thread in Figure 2. The PerHandleKey variable will return anything that was passed as the CompletionKey parameter to CreateIoCompletionPort when associating a given socket handle. The Overlap parameter returns a pointer to the OVERLAPPEDPLUS structure that is used to initiate the overlapped operation. Keep in mind that if an overlapped operation fails immediately (that is, returns SOCKET_ERROR and the error is not WSA_IO_PENDING), then no completion notification will be posted to the queue. Alternately, if the overlapped call succeeds or fails with WSA_IO_PENDING, a completion event will always be posted to the completion port.
For more information on using completion ports with Winsock, take a look at the Microsoft® Platform SDK, which includes a Winsock completion port sample (under the Winsock section in the iocp directory). Additionally, consult Network Programming for Microsoft Windows by Anthony Jones and Jim Ohlund (Microsoft Press, 1999), which includes samples for completion ports as well as the other I/O models.

The Windows NT and Windows 2000 Sockets Architecture
A basic understanding of the sockets architecture of Windows NT and Windows 2000 is helpful in fully comprehending the principles of scalability. Figure 3 illustrates the current implementation of Winsock in Windows 2000. An application should not depend on the specific details mentioned here (names of drivers, DLLs, and so on), as these may change in a future release of the operating system.

Figure 3 Socket Architecture

The Windows Sockets 2.0 specification allows for a variety of protocols and their related providers. These user-mode service providers can be layered on top of existing providers in order to extend their functionality. For example, a proxy layered service provider (LSP) may install itself on top of the existing TCP/IP provider. This allows the proxy LSP to intercept and redirect or log calls to the base provider.
Unlike some other operating systems, the Windows NT and Windows 2000 transport protocols do not have a sockets-style interface which applications can use to talk to them directly. Instead, they implement a much more general API called the Transport Driver Interface (TDI). The generality of this API keeps the subsystems of Windows NT from being tied to a particular flavor-of-the-decade network programming interface. The Winsock kernel mode driver provides the sockets emulation (currently implemented in AFD.SYS). This driver is responsible for the connection and buffer management needed to provide a sockets-style interface to an application. AFD.SYS, in turn, uses TDI to talk to the transport protocol driver.

Who Manages the Buffers?
As just mentioned, AFD.SYS handles buffer management for applications that use Winsock to talk to the transport protocol drivers. This means that when an application calls the send or WSASend function to send data, the data gets copied by AFD.SYS to its internal buffers (up to the SO_SNDBUF setting) and the send or WSASend function returns immediately. The data is then sent by AFD.SYS behind the application's back, so to speak. Of course, if the application wants to issue a send for a buffer larger than the SO_SNDBUF setting, the WSASend call blocks until all the data is sent.
Similarly, on receiving data from the remote client, AFD.SYS will copy the data to its own buffers as long as there is no outstanding data to receive from the application, and as long as the SO_RCVBUF setting is not exceeded. When the application calls recv or WSARecv, the data is copied from AFD.SYS's buffers to the application-provided buffer.
In most cases, this architecture works very well. This is especially true for applications that use traditional socket paradigms with nonoverlapped sends and receives. Before going apoplectic over the buffer copying that's involved in sending and receiving data, a programmer should take great care to understand the consequences of turning off the buffering in AFD.SYS, which can be done by setting the SO_SNDBUF and SO_RCVBUF values to 0 using the setsockopt API.
Consider, for example, an application that turns off buffering by setting SO_SNDBUF to 0 and issues a blocking send. In this case, the application's buffer is locked into memory by the kernel and the send API does not return until the other end of the connection acknowledges the entire buffer. That may seem like a neat way to determine whether all your data has actually been received by the other side, but in fact it is a bad thing to do. For one thing, even acknowledgment by the remote TCP is no guarantee that the data will be delivered to the client application, as there may be out-of-resource conditions that prevent it from copying the data from AFD.SYS. An even more significant problem with this approach is that your application can only do one send at a time in each thread. This is extremely inefficient, to say the least.
Turning off receive buffering in AFD.SYS by setting SO_RCVBUF to 0 offers no real performance gains. Setting the receive buffer to 0 forces received data to be buffered at a lower layer than Winsock. Again, this leads to buffer copying when you actually post a receive, which defeats your purpose in turning off AFD's buffering.
It should be clear by now that turning off buffering is a really bad idea for most applications. Turning off receive buffering is not usually necessary, as long as the application takes care to always have a few overlapped WSARecvs outstanding on a connection. The availability of posted application buffers removes the need for AFD to buffer incoming data.
However, a high-performance server application can turn off the send buffering, yet not lose performance. Such an application must, however, take great care to ensure that it posts multiple overlapped sends, instead of waiting for one overlapped send to complete before posting another. If the application posts overlapped sends in a sequential manner, it wastes the time window between one send completion and the posting of the next send. If it had another buffer already posted, the transport would be able to use that buffer immediately and not wait for the application's next send operation.

Resource Constraints
A major design goal of any server application is robustness. That is, you want your server application to ride out any transient problems that might occur, such as a spike in the number of client requests, temporary lack of available memory, or other relatively short-lived phenomena. To handle these incidents gracefully, the application developer should be aware of the resource constraints on typical Windows NT and Windows 2000-based systems.
The most basic resource that you have direct control over is the bandwidth of the network on which the application is sending data. It's a fair assumption that an application that uses the User Datagram Protocol (UDP) is probably already aware of this limitation, since such a server would want to minimize packet loss. However, even with TCP connections, a server should take great care to never overrun the network for extended periods of time. Otherwise, there will be a lot of retransmissions and aborted connections. The specifics of the bandwidth management method are application-dependent and are beyond the scope of this article.
Virtual memory used by the application also needs careful management. Conservative memory allocations and frees, perhaps using lookaside lists (a cache) to reuse previous allocations, will keep the server application's footprint smaller and allow the system to keep more of the application address space in memory all the time. An application can also use the SetWorkingSetSize Win32 API to increase the amount of physical memory the operating system will let it use.
There are two other resource constraints that an application indirectly encounters when using Winsock. The first one is the locked page limit. Whenever an application posts a send or receive, and AFD.SYS's buffering is disabled, all pages in the buffer are locked into physical memory. They need to be locked because the memory will be accessed by kernel-mode drivers and cannot be paged out for the duration of the access. This would not be a problem in most circumstances, except that the operating system must make sure that there is always some pageable memory available to other applications. The goal is to prevent an ill-behaved application from locking up all of the physical RAM and bringing down the system. This means that your application must be conscious of hitting a system-defined limit on the number of pages locked in memory.
The limit on locked memory in Windows NT and Windows 2000 is about one-eighth the physical RAM for all applications combined. This is a rough estimate and should not be used as an exact figure on which to base calculations. Just be aware that an overlapped operation may occasionally fail with ERROR_INSUFFICIENT_RESOURCES, and this limitation is a likely cause if there are too many send/receives pending. The application should take care not to have an excessive amount of memory locked in this fashion. Also note that all pages containing your buffer(s) will be locked, so it pays to have buffers that are aligned on page boundaries.
The other resource limitation that an application will run into somewhere in its lifetime is the system non-paged pool limit. The Windows NT and Windows 2000 drivers have the ability to allocate memory from a special non-paged pool. The memory allocated from this region is never paged out. It is intended to store information that can be accessed by various kernel-mode components, some of which may not be able to access a location in memory that is paged out. Whenever an application creates a socket (or opens a file, for that matter), some amount of non-paged pool is allocated.
In addition, the act of binding and/or connecting a socket also results in additional non-paged pool allocations. Add to this the fact that an outstanding I/O request, such as a send or a receive, allocates a little more non-paged pool (a small structure is required to keep track of pending I/O operations), and you can see that eventually there will be a problem. The operating system therefore limits the amount of non-pageable memory. The exact amount of non-paged pool allocated per connection is different for Windows NT 4.0 and Windows 2000 and will likely be different again for future versions of Windows. In the interests of your application's longevity, you should not calculate the exact amount of non-paged pool you need.
However, the application must take care to avoid hitting the non-paged limit. When the system runs low on non-paged pool memory, you expose yourself to the risk that some driver that's completely unrelated to your application will throw a fit because it cannot allocate a non-paged pool at that particular time. In the worst case, this can lead to a system crash. This is especially likely (but impossible to predict in advance) in the presence of third-party devices and drivers on a system. You must also remember that there might be other server applications running on the same machine that consume non-paged pool memory. It is best to be very conservative in your resource estimation, and design the application accordingly.
Handling the resource constraints is complicated by the fact that there is no special error code returned when either of the conditions is encountered. The application will get generic WSAENOBUFS or ERROR_INSUFFICIENT_RESOURCES errors from various calls. To handle these errors, first increase the working set of the application to some reasonable maximum. (For more information on adjusting your working set, see the Bugslayer column by John Robbins in this issue of MSDN Magazine.) Then, if you still continue to get these errors, check the possibility that you may be exceeding the bandwidth of the medium. Once you have done that, make sure you don't have too many send or receives outstanding. Finally, if you still receive out-of-resource errors, you're most probably running into non-paged pool limits. To free up a non-paged pool, the application must close a good portion of its outstanding connections and wait for the transient situation to correct itself.

Accepting Connections
One of the most common things a server does is accept connections from clients. The AcceptEx function is the only Winsock API capable of using overlapped I/O to accept connections on a socket. The interesting thing about AcceptEx is that it requires an additional socket as one of the parameters to the API. In a normal, synchronous accept function call, the new socket is the return value from the API. However, since AcceptEx is an overlapped operation, the accepted socket must be created (but not bound or connected) in advance, and passed to the API. A typical psuedocode snippet that uses AcceptEx might look like the following:
do { -Wait for a previous AcceptEx to complete -Create a new socket and associate it with the completion port -Allocate context structure etc. -Post an AcceptEx request. }while(TRUE);
A responsive server must always have enough AcceptEx calls outstanding so that any client connection can be immediately handled. The number of posted AcceptEx operations will depend on the type of traffic your server expects. A high incoming connection rate (because of short-lived connections or spurts in traffic) requires more outstanding AcceptEx calls than an application where the clients connect infrequently. It may be wise to let the number of posted AcceptEx operations vary between application-specific low and high watermarks, and avoid deciding on one fixed number as the magic figure.
On Windows 2000, Winsock provides some help in determining if the application is falling behind on posting AcceptEx requests. When creating the listening socket, associate it with an event by using the WSAEventSelect API and registering for an FD_ACCEPT notification. If there are no accept operations pending, the event will be signaled by an incoming connection. This event can thus be used as an indication that you need to post more AcceptEx requests or detect a possible misbehaving remote entity, as we'll describe shortly. This mechanism is not available on Windows NT 4.0.
A significant benefit to using the AcceptEx call is the ability to receive data and accept a client connection in one call via the lpOutputBuffer parameter. This means that if a client connects and immediately sends data, AcceptEx will complete only after the connection is established and the client sends data. This can be very useful, but it can also lead to problems since the AcceptEx call will not return until data is received, even if a connection has been established. This is because an AcceptEx call with an output buffer is not one atomic operation, but a two-step process consisting of accepting a connection and waiting for incoming data. However, the application is not notified that a connection has been accepted before data is received. That means a client could connect to your server and not send any data. With enough of these connections, your server will start to refuse connections to legitimate clients because it has no more accepts pending. This is a common method of waging a denial of service attack.
To prevent malicious attacks or stale connections, the accepting thread should occasionally check the sockets outstanding in AcceptEx by calling getsockopt and SO_CONNECT_TIME. The option value is set to the length of time the socket has been connected for, or -1 if it is still unconnected. The WSAEventSelect feature serves as an excellent indicator that the sockets that are outstanding in AcceptEx need their connection times checked. Any connections that have existed for a while without receiving data from the client should be terminated by closing the socket supplied to AcceptEx. An application should not, under most noncritical circumstances, close a socket that is outstanding in AcceptEx but not yet connected. For performance reasons, the kernel-mode data structures created for and associated with such an AcceptEx request will not be cleaned up until a new connection comes in or the listening socket itself is closed.
It may seem that the logical thread to post AcceptEx requests is one of the worker threads that is associated with the completion port and involved in processing other I/O completion notifications. However, recall that a worker thread should not execute a blocking or high-latency system call if such an action can be avoided. One of the side effects of the layered architecture of Winsock2 is that the overhead to a socket or WSASocket API call may be significant. Every AcceptEx call requires the creation of a new socket, so it is best to have a separate thread that posts AcceptEx and is not involved in other I/O processing. You may also choose to use this thread for performing other tasks such as event logging.
One last thing to note about AcceptEx is that a Winsock2 implementation from another vendor is not required to implement these APIs. This also applies to the other APIs that are specific to Microsoft, such as TransmitFile, GetAcceptExSockAddrs, and any others that Microsoft may add in a later version of Windows. On systems running Windows NT and Windows 2000, these APIs are implemented in the Microsoft provider DLL (mswsock.dll), and can be invoked by linking with mswsock.lib, or dynamically loading the function pointers via WSAIoctl SIO_GET_EXTENSION_FUNCTION_POINTER.
Calling the function without previously obtaining a function pointer (that is, by linking with mswsock.lib and calling AcceptEx directly) is costly because AcceptEx sits outside the layered architecture of Winsock2. AcceptEx must request a function pointer using WSAIoctl for every call on the off chance that the application is actually trying to invoke AcceptEx from a provider layered on top of mswsock (see Figure 3). To avoid this significant performance penalty on each call, an application that intends to use these APIs should obtain the pointers to these functions directly from the layered provider by calling WSAIoctl.

TransmitFile and TransmitPackets
Winsock offers two functions for transmitting data that are optimized for file and memory transfers. The TransmitFile API is present on both Windows NT 4.0 and Windows 2000, while TransmitPackets is a new Microsoft extension function that is expected to be available in a future release of Windows. TransmitFile allows the contents of a file to be transferred on a socket. Normally, if an application were to send the contents of a file over a socket, it would have to call CreateFile to open the file and then loop on ReadFile and WSASend until the entire file was read. This is very inefficient because each ReadFile and WSASend call requires a transition from user mode to kernel-mode. TransmitFile simply requires an open handle to the file to transmit and the number of bytes to transfer. The overhead is incurred when opening the file via CreateFile, followed by a single kernel-mode transition. If your app sends the contents of files over sockets, this is the API to use.
The TransmitPackets API takes the TransmitFile API a step further by allowing the caller to specify multiple file handles and memory buffers to be transmitted in a single call. The function prototype looks like this:
BOOL TransmitPackets( SOCKET hSocket, LPTRANSMIT_PACKET_ELEMENT lpPacketArray, DWORD nElementCount, DWORD nSendSize, LPOVERLAPPED lpOverlapped, DWORD dwFlags );
The lpPacketArray is an array of structures. Each entry can specify either a file handle or a memory buffer to be transmitted. The structure is defined as:
typedef struct _TRANSMIT_PACKETS_ELEMENT { DWORD dwElFlags; DWORD cLength; union { struct { LARGE_INTEGER nFileOffset; HANDLE hFile; }; PVOID pBuffer; }; } TRANSMIT_FILE_BUFFERS;
The fields are self explanatory. The dwElFlags field identifies whether the current element specifies a file handle or memory buffer via the constants TF_ELEMENT_FILE and TF_ELEMENT_MEMORY. The cLength field dictates how many bytes to send from the given data source (a zero indicates the entire file in the case of a file element). The unnamed union then contains the memory buffer of file handle (and possible offset) of the data to be sent.
Another benefit of using these two APIs is that you can reuse the socket handle by specifying the TF_REUSE_SOCKET flag in addition to the TF_DISCONNECT flag. Once the API completes the data transfer, a transport-level disconnect is initiated. The socket can then be reused in an AcceptEx call. Using this optimization would lessen the overhead associated with creating sockets in the separate accept thread, as discussed earlier.
The only caveat of using either of these two extension APIs is that on Windows NT Workstation or Windows 2000 Professional only two requests will be processed at a time. You must be running on Windows NT or Windows 2000 Server, Windows 2000 Advanced Server, or Windows 2000 Data Center to get full usage of these specialized APIs.

Putting it Together
In the preceding sections, we covered the APIs and methods necessary for high-performance, scalable applications, as well as the resource bottlenecks that may be encountered. What does this mean to you? Well, that depends on how your server and client are structured. The more control you have over the design of both the client and server, the better you can avoid bottlenecks.
Let's look at a sample scenario. In this situation we'll design a server that handles clients that connect, send a request, receive data from the server, and then disconnect. In this situation, the server will create a listening socket and associate it with a completion port, creating a worker thread for each CPU. Another thread will post the AcceptEx calls. Since you know the client will connect and immediately send data, supplying a receive buffer can make things substantially easier. Of course, you shouldn't forget to occasionally poll the client sockets used in the AcceptEx calls, using the SO_CONNECT_TIME option to make sure there are no stale connections.
An important issue in this design is to determine how many outstanding AcceptEx calls are allowed. Because a receive buffer is being posted with each accept call, a significant number of pages could be locked in memory. (Remember each overlapped operation consumes a small portion of non-paged pool and also locks any data buffers into memory.) There is no real answer or concrete formula for determining how many accept calls should be allowed. The best solution is to make this number tunable so that performance tests may be run to determine the best value for the typical environment that the server will be running in.
Now that you have determined how the server will accept connections, the next step is sending data. An important factor in deciding how to send data is the number of concurrent connections you expect the server to handle. In general, the server should limit the number of concurrent connections, as well as the number of outstanding send calls. More established connections mean more non-paged pool usage. The number of concurrent send calls should be limited to prevent reaching the locked pages limit. Again, both of these limits should be tunable.
In this situation it is not necessary to disable the per-socket receive buffers since the only receive that occurs is in AcceptEx call. Of course it wouldn't hurt for you to guarantee that each connection has a receive buffer posted. Now, if the client/server interaction changes so that the client sends additional data after the initial request, disabling the receive buffer would be a bad idea unless, in order to receive these additional requests, you guarantee that an overlapped receive is posted on each connection.

Conclusion
Developing a scalable Winsock server is not terribly difficult. It's a matter of setting up a listening socket, accepting connections, and making overlapped send and receive calls. The main challenge lies in managing resources by placing limits on the number of outstanding overlapped calls so that the non-paged pool is not exhausted. Following the guidelines we covered here will allow you to create high-performance, scalable server applications.