IT博客-幽灵狼-随笔分类-Development

Writing Programs with NCURSES

幽灵狼 — Sat, 25 Feb 2006 02:51:00 GMT

http://www.cs.mun.ca/~rod/ncurses/ncurses.html

幽灵狼 2006-02-25 10:51 发表评论

Linux Netlink Sockets

幽灵狼 — Sat, 17 Dec 2005 09:46:00 GMT

Netlink Sockets are the method that the Linux Kernel uses to pass Routing, Interface and other miscellaneous networking information around, both within the kernel and between the kernel and userspace. It replaces the old ioctl(2) based method and is far far superior - infact as soon as the kernel receives a networking ioctl it is converted to a netlink message before being shipped off for further processing.

Basic Introduction

The netlink protocol uses a special type of socket(2) to communicate with the Linux kernel. This socket is called a "Netlink Socket" surprisingly enough and can be created by specifing AF_NETLINK as the first argument to a socket(2) call, The socket type (second argument) can be either SOCK_DGRAM or SOCK_RAW, it makes absolutely no difference!, the third argument (netlink family) specifies which part of the linux networking stack you want to modify, for example NETLINK_ROUTE can be specified to modify the routing table (including interfaces), or NETLINK_ARPD can be specified to allow the arp table to be manipulated. A full list of available netlink families is found in netlink(7).

NETLINK_ROUTE is the most commonly used netlink family as it is used to add, delete and modify routes from the kernels routing table and can also be used to add, delete and modify the interfaces on the machine.

Some of the basic Netlink principles are documented in RFC:3549.

Programming Netlink

There is somewhat of a lack of easy to read documentation regarding how to program using netlink sockets, however the information is all there in the end. As a start try the netlink(3), netlink(7), rtnetlink(3) and rtnetlink(7) manpages which provide a very technical description of the netlink protocol, all the information that you need to write a program using netlink is contained in these manpages.... should be easy from here right?

The iproute2 package is the base implementation of the netlink interface, it replaces all the old linux networking utilities (ifconfig, route, etc) into a single binary called ip which performs all of the tasks using the netlink interface. I highly recommend that you use this package as a reference when coding netlink related applications. In particular iproute2 contains a netlink library (libnetlink) which deals with much of the low level protocol interactions between your application and the kernel. Unfortunately the library is not seperately packaged and you'll have to spend some time extracting it from the iproute2 package before it is useful.

Coming Soon - Some basic examples of how to program using libnetlink -- Talk to MattBrown if you want them and they're not here yet!

(ha! It's been ages and you've not put up any examples! So I've written one that shows route add/del events, see LinuxNetlinkSocketExample --PerryLorier).

Applications Known to Use Netlink Sockets

Quagga
/sbin/ip (IpRoute2 package)

Random notes (things I wish were documented somewhere but aren't)

if you want to recieve RTM_NEWNEIGH messages, you need /proc/sys/net/ipv{4,6}/neigh/*/app_probes to be non 0.

I don't know why. They might have been drunk at the time -- PerryLorier
The reason why is that much of the system parameters are moving this way and they were just too lazy to convert other ones too I suspect -- IanMcDonald

URL for this article: http://www.wlug.org.nz/LinuxNetlinkSockets

幽灵狼 2005-12-17 17:46 发表评论

Linux Netlink Socket Example

幽灵狼 — Sat, 17 Dec 2005 09:37:00 GMT

This is a sample program that uses a netlink socket to listen to route change events and prints out some rudimentary information about them. It's very simple and boring, but hopefully useful.

This being a wiki, I also expect everyone to hack on this code and make it nicer, this is pretty hideous, but I want to get on with my real program now. So if you're reading this page your mission (if you choose to accept it) is to clean up the below code a little bit (doesn't need to be much).

See LinuxNetlinkSockets

#include 

#include 
#include 
#include 
#include 
#include 

#include 
#include 

#if 0
//#define MYPROTO NETLINK_ARPD
#define MYMGRP RTMGRP_NEIGH
// if you want the above you'll find that the kernel must be compiled with CONFIG_ARPD, and
// that you need MYPROTO=NETLINK_ROUTE, since the kernel arp code {re,ab}uses rtnl (NETLINK_ROUTE)

#else
#define MYPROTO NETLINK_ROUTE
#define MYMGRP RTMGRP_IPV4_ROUTE
#endif

struct msgnames_t {
        int id;
        char *msg;
} typenames[] = {
#define MSG(x) { x, #x }
        MSG(RTM_NEWROUTE),
        MSG(RTM_DELROUTE),
        MSG(RTM_GETROUTE),
#undef MSG
        {0,0}
};

char *lookup_name(struct msgnames_t *db,int id)
{
        static char name[512];
        struct msgnames_t *msgnamesiter;
        for(msgnamesiter=db;msgnamesiter->msg;++msgnamesiter) {
                if (msgnamesiter->id == id)
                        break;
        }
        if (msgnamesiter->msg) {
                return msgnamesiter->msg;
        }
        snprintf(name,sizeof(name),"#%i",id);
        return name;
}

int open_netlink()
{
        int sock = socket(AF_NETLINK,SOCK_RAW,MYPROTO);
        struct sockaddr_nl addr;

        memset((void *)&addr, 0, sizeof(addr));

        if (sock<0)
                return sock;
        addr.nl_family = AF_NETLINK;
        addr.nl_pid = getpid();
        addr.nl_groups = MYMGRP;
        if (bind(sock,(struct sockaddr *)&addr,sizeof(addr))<0)
                return -1;
        return sock;
}

int read_event(int sock)
{
        struct sockaddr_nl nladdr;
        struct msghdr msg;
        struct iovec iov[2];
        struct nlmsghdr nlh;
        char buffer[65536];
        int ret;
        iov[0].iov_base = (void *)&nlh;
        iov[0].iov_len = sizeof(nlh);
        iov[1].iov_base = (void *)buffer;
        iov[1].iov_len = sizeof(buffer);
        msg.msg_name = (void *)&(nladdr);
        msg.msg_namelen = sizeof(nladdr);
        msg.msg_iov = iov;
        msg.msg_iovlen = sizeof(iov)/sizeof(iov[0]);
        ret=recvmsg(sock, &msg, 0);
        if (ret<0) {
                return ret;
        }
        printf("Type: %i (%s)\n",(nlh.nlmsg_type),lookup_name(typenames,nlh.nlmsg_type));
        printf("Flag:");
#define FLAG(x) if (nlh.nlmsg_type & x) printf(" %s",#x)
        FLAG(NLM_F_REQUEST);
        FLAG(NLM_F_MULTI);
        FLAG(NLM_F_ACK);
        FLAG(NLM_F_ECHO);
        FLAG(NLM_F_REPLACE);
        FLAG(NLM_F_EXCL);
        FLAG(NLM_F_CREATE);
        FLAG(NLM_F_APPEND);
#undef FLAG
        printf("\n");
        printf("Seq : %i\n",nlh.nlmsg_seq);
        printf("Pid : %i\n",nlh.nlmsg_pid);
        printf("\n");
        return 0;
}

int main(int argc, char *argv[])
{         int nls = open_netlink();
        if (nls<0) {
                err(1,"netlink");
        }
        while (1)
                read_event(nls);
        return 0;
}

幽灵狼 2005-12-17 17:37 发表评论

Netlink Socket for Linux

幽灵狼 — Sat, 17 Dec 2005 09:31:00 GMT

Kernel Korner - Why and How to Use Netlink Socket

By Kevin He on Wed, 2005-01-05 02:00. SysAdmin Use this bidirectional, versatile method to pass data between kernel and user space.

Due to the complexity of developing and maintaining the kernel, only the most essential and performance-critical code are placed in the kernel. Other things, such as GUI, management and control code, typically are programmed as user-space applications. This practice of splitting the implementation of certain features between kernel and user space is quite common in Linux. Now the question is how can kernel code and user-space code communicate with each other?

The answer is the various IPC methods that exist between kernel and user space, such as system call, ioctl, proc filesystem or netlink socket. This article discusses netlink socket and reveals its advantages as a network feature-friendly IPC.

Introduction

Netlink socket is a special IPC used for transferring information between kernel and user-space processes. It provides a full-duplex communication link between the two by way of standard socket APIs for user-space processes and a special kernel API for kernel modules. Netlink socket uses the address family AF_NETLINK, as compared to AF_INET used by TCP/IP socket. Each netlink socket feature defines its own protocol type in the kernel header file include/linux/netlink.h.

The following is a subset of features and their protocol types currently supported by the netlink socket:

NETLINK_ROUTE: communication channel between user-space routing dæmons, such as BGP, OSPF, RIP and kernel packet forwarding module. User-space routing dæmons update the kernel routing table through this netlink protocol type.
NETLINK_FIREWALL: receives packets sent by the IPv4 firewall code.
NETLINK_NFLOG: communication channel for the user-space iptable management tool and kernel-space Netfilter module.
NETLINK_ARPD: for managing the arp table from user space.

Why do the above features use netlink instead of system calls, ioctls or proc filesystems for communication between user and kernel worlds? It is a nontrivial task to add system calls, ioctls or proc files for new features; we risk polluting the kernel and damaging the stability of the system. Netlink socket is simple, though: only a constant, the protocol type, needs to be added to netlink.h. Then, the kernel module and application can talk using socket-style APIs immediately.

Netlink is asynchronous because, as with any other socket API, it provides a socket queue to smooth the burst of messages. The system call for sending a netlink message queues the message to the receiver's netlink queue and then invokes the receiver's reception handler. The receiver, within the reception handler's context, can decide whether to process the message immediately or leave the message in the queue and process it later in a different context. Unlike netlink, system calls require synchronous processing. Therefore, if we use a system call to pass a message from user space to the kernel, the kernel scheduling granularity may be affected if the time to process that message is long.

The code implementing a system call in the kernel is linked statically to the kernel in compilation time; thus, it is not appropriate to include system call code in a loadable module, which is the case for most device drivers. With netlink socket, no compilation time dependency exists between the netlink core of Linux kernel and the netlink application living in loadable kernel modules.

Netlink socket supports multicast, which is another benefit over system calls, ioctls and proc. One process can multicast a message to a netlink group address, and any number of other processes can listen to that group address. This provides a near-perfect mechanism for event distribution from kernel to user space.

System call and ioctl are simplex IPCs in the sense that a session for these IPCs can be initiated only by user-space applications. But, what if a kernel module has an urgent message for a user-space application? There is no way of doing that directly using these IPCs. Normally, applications periodically need to poll the kernel to get the state changes, although intensive polling is expensive. Netlink solves this problem gracefully by allowing the kernel to initiate sessions too. We call it the duplex characteristic of the netlink socket.

Finally, netlink socket provides a BSD socket-style API that is well understood by the software development community. Therefore, training costs are less as compared to using the rather cryptic system call APIs and ioctls.

Relating to the BSD Routing Socket

In BSD TCP/IP stack implementation, there is a special socket called the routing socket. It has an address family of AF_ROUTE, a protocol family of PF_ROUTE and a socket type of SOCK_RAW. The routing socket in BSD is used by processes to add or delete routes in the kernel routing table.

In Linux, the equivalent function of the routing socket is provided by the netlink socket protocol type NETLINK_ROUTE. Netlink socket provides a functionality superset of BSD's routing socket.

Netlink Socket APIs

The standard socket APIs-socket(), sendmsg(), recvmsg() and close()-can be used by user-space applications to access netlink socket. Consult the man pages for detailed definitions of these APIs. Here, we discuss how to choose parameters for these APIs only in the context of netlink socket. The APIs should be familiar to anyone who has written an ordinary network application using TCP/IP sockets.

To create a socket with socket(), enter:

int socket(int domain, int type, int protocol)

The socket domain (address family) is AF_NETLINK, and the type of socket is either SOCK_RAW or SOCK_DGRAM, because netlink is a datagram-oriented service.

The protocol (protocol type) selects for which netlink feature the socket is used. The following are some predefined netlink protocol types: NETLINK_ROUTE, NETLINK_FIREWALL, NETLINK_ARPD, NETLINK_ROUTE6 and NETLINK_IP6_FW. You also can add your own netlink protocol type easily.

Up to 32 multicast groups can be defined for each netlink protocol type. Each multicast group is represented by a bit mask, 1<

bind()

As for a TCP/IP socket, the netlink bind() API associates a local (source) socket address with the opened socket. The netlink address structure is as follows:

struct sockaddr_nl
{
  sa_family_t    nl_family;  /* AF_NETLINK   */
  unsigned short nl_pad;     /* zero         */
  __u32          nl_pid;     /* process pid */
  __u32          nl_groups;  /* mcast groups mask */
} nladdr;

When used with bind(), the nl_pid field of the sockaddr_nl can be filled with the calling process' own pid. The nl_pid serves here as the local address of this netlink socket. The application is responsible for picking a unique 32-bit integer to fill in nl_pid:

NL_PID Formula 1:  nl_pid = getpid();

Formula 1 uses the process ID of the application as nl_pid, which is a natural choice if, for the given netlink protocol type, only one netlink socket is needed for the process.

In scenarios where different threads of the same process want to have different netlink sockets opened under the same netlink protocol, Formula 2 can be used to generate the nl_pid:


NL_PID Formula 2: pthread_self() << 16 | getpid();

In this way, different pthreads of the same process each can have their own netlink socket for the same netlink protocol type. In fact, even within a single pthread it's possible to create multiple netlink sockets for the same protocol type. Developers need to be more creative, however, in generating a unique nl_pid, and we don't consider this to be a normal-use case.

If the application wants to receive netlink messages of the protocol type that are destined for certain multicast groups, the bitmasks of all the interested multicast groups should be ORed together to form the nl_groups field of sockaddr_nl. Otherwise, nl_groups should be zeroed out so the application receives only the unicast netlink message of the protocol type destined for the application. After filling in the nladdr, do the bind as follows:


bind(fd, (struct sockaddr*)&nladdr, sizeof(nladdr));

Sending a Netlink Message

In order to send a netlink message to the kernel or other user-space processes, another struct sockaddr_nl nladdr needs to be supplied as the destination address, the same as sending a UDP packet with sendmsg(). If the message is destined for the kernel, both nl_pid and nl_groups should be supplied with 0.

If the message is a unicast message destined for another process, the nl_pid is the other process' pid and nl_groups is 0, assuming nlpid Formula 1 is used in the system.

If the message is a multicast message destined for one or multiple multicast groups, the bitmasks of all the destination multicast groups should be ORed together to form the nl_groups field. We then can supply the netlink address to the struct msghdr msg for the sendmsg() API, as follows:


struct msghdr msg;
msg.msg_name = (void *)&(nladdr);
msg.msg_namelen = sizeof(nladdr);

The netlink socket requires its own message header as well. This is for providing a common ground for netlink messages of all protocol types.

Because the Linux kernel netlink core assumes the existence of the following header in each netlink message, an application must supply this header in each netlink message it sends:


struct nlmsghdr
{
  __u32 nlmsg_len;   /* Length of message */
  __u16 nlmsg_type;  /* Message type*/
  __u16 nlmsg_flags; /* Additional flags */
  __u32 nlmsg_seq;   /* Sequence number */
  __u32 nlmsg_pid;   /* Sending process PID */
};

nlmsg_len has to be completed with the total length of the netlink message, including the header, and is required by netlink core. nlmsg_type can be used by applications and is an opaque value to netlink core. nlmsg_flags is used to give additional control to a message; it is read and updated by netlink core. nlmsg_seq and nlmsg_pid are used by applications to track the message, and they are opaque to netlink core as well.

A netlink message thus consists of nlmsghdr and the message payload. Once a message has been entered, it enters a buffer pointed to by the nlh pointer. We also can send the message to the struct msghdr msg:


struct iovec iov;
iov.iov_base = (void *)nlh;
iov.iov_len = nlh->nlmsg_len;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;

After the above steps, a call to sendmsg() kicks out the netlink message:


sendmsg(fd, &msg, 0);

Receiving Netlink Messages

A receiving application needs to allocate a buffer large enough to hold netlink message headers and message payloads. It then fills the struct msghdr msg as shown below and uses the standard recvmsg() to receive the netlink message, assuming the buffer is pointed to by nlh:


struct sockaddr_nl nladdr;
struct msghdr msg;
struct iovec iov;
iov.iov_base = (void *)nlh;
iov.iov_len = MAX_NL_MSG_LEN;
msg.msg_name = (void *)&(nladdr);
msg.msg_namelen = sizeof(nladdr);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
recvmsg(fd, &msg, 0);

After the message has been received correctly, the nlh should point to the header of the just-received netlink message. nladdr should hold the destination address of the received message, which consists of the pid and the multicast groups to which the message is sent. And, the macro NLMSG_DATA(nlh), defined in netlink.h, returns a pointer to the payload of the netlink message. A call to close(fd) closes the netlink socket identified by file descriptor fd.

Kernel-Space Netlink APIs

The kernel-space netlink API is supported by the netlink core in the kernel, net/core/af_netlink.c. From the kernel side, the API is different from the user-space API. The API can be used by kernel modules to access the netlink socket and to communicate with user-space applications. Unless you leverage the existing netlink socket protocol types, you need to add your own protocol type by adding a constant to netlink.h. For example, we can add a netlink protocol type for testing purposes by inserting this line into netlink.h:

#define NETLINK_TEST  17

Afterward, you can reference the added protocol type anywhere in the Linux kernel.

In user space, we call socket() to create a netlink socket, but in kernel space, we call the following API:


struct sock *
netlink_kernel_create(int unit, 
           void (*input)(struct sock *sk, int len));

The parameter unit is, in fact, the netlink protocol type, such as NETLINK_TEST. The function pointer, input, is a callback function invoked when a message arrives at this netlink socket.

After the kernel has created a netlink socket for protocol NETLINK_TEST, whenever user space sends a netlink message of the NETLINK_TEST protocol type to the kernel, the callback function, input(), which is registered by netlink_kernel_create(), is invoked. The following is an example implementation of the callback function input:


void input (struct sock *sk, int len)
{
 struct sk_buff *skb;
 struct nlmsghdr *nlh = NULL;
 u8 *payload = NULL;
 while ((skb = skb_dequeue(&sk->receive_queue)) 
       != NULL) {
 /* process netlink message pointed by skb->data */
 nlh = (struct nlmsghdr *)skb->data;
 payload = NLMSG_DATA(nlh);
 /* process netlink message with header pointed by 
  * nlh and payload pointed by payload
  */
 }   
}

This input() function is called in the context of the sendmsg() system call invoked by the sending process. It is okay to process the netlink message inside input() if it's fast. When the processing of netlink message takes a long time, however, we want to keep it out of input() to avoid blocking other system calls from entering the kernel. Instead, we can use a dedicated kernel thread to perform the following steps indefinitely. Use skb = skb_recv_datagram(nl_sk) where nl_sk is the netlink socket returned by netlink_kernel_create(). Then, process the netlink message pointed to by skb->data.

This kernel thread sleeps when there is no netlink message in nl_sk. Thus, inside the callback function input(), we need to wake up only the sleeping kernel thread, like this:


void input (struct sock *sk, int len)
{
  wake_up_interruptible(sk->sleep);
}

This is a more scalable communication model between user space and kernel. It also improves the granularity of context switches.

Sending Netlink Messages from the Kernel

Just as in user space, the source netlink address and destination netlink address need to be set when sending a netlink message. Assuming the socket buffer holding the netlink message to be sent is struct sk_buff *skb, the local address can be set with:


NETLINK_CB(skb).groups = local_groups;
NETLINK_CB(skb).pid = 0;   /* from kernel */

The destination address can be set like this:


NETLINK_CB(skb).dst_groups = dst_groups;
NETLINK_CB(skb).dst_pid = dst_pid;

Such information is not stored in skb->data. Rather, it is stored in the netlink control block of the socket buffer, skb.

To send a unicast message, use:


int 
netlink_unicast(struct sock *ssk, struct sk_buff 
                *skb, u32 pid, int nonblock);

where ssk is the netlink socket returned by netlink_kernel_create(), skb->data points to the netlink message to be sent and pid is the receiving application's pid, assuming NLPID Formula 1 is used. nonblock indicates whether the API should block when the receiving buffer is unavailable or immediately return a failure.

You also can send a multicast message. The following API delivers a netlink message to both the process specified by pid and the multicast groups specified by group:


void 
netlink_broadcast(struct sock *ssk, struct sk_buff 
         *skb, u32 pid, u32 group, int allocation);

group is the ORed bitmasks of all the receiving multicast groups. allocation is the kernel memory allocation type. Typically, GFP_ATOMIC is used if from interrupt context; GFP_KERNEL if otherwise. This is due to the fact that the API may need to allocate one or many socket buffers to clone the multicast message.

Closing a Netlink Socket from the Kernel

Given the struct sock *nl_sk returned by netlink_kernel_create(), we can call the following kernel API to close the netlink socket in the kernel:


sock_release(nl_sk->socket);

So far, we have shown only the bare minimum code framework to illustrate the concept of netlink programming. We now will use our NETLINK_TEST netlink protocol type and assume it already has been added to the kernel header file. The kernel module code listed here contains only the netlink-relevant part, so it should be inserted into a complete kernel module skeleton, which you can find from many other reference sources.

Unicast Communication between Kernel and Application

In this example, a user-space process sends a netlink message to the kernel module, and the kernel module echoes the message back to the sending process. Here is the user-space code:


#include 
#include 
#define MAX_PAYLOAD 1024  /* maximum payload size*/
struct sockaddr_nl src_addr, dest_addr;
struct nlmsghdr *nlh = NULL;
struct iovec iov;
int sock_fd;
void main() {
 sock_fd = socket(PF_NETLINK, SOCK_RAW,NETLINK_TEST);
 memset(&src_addr, 0, sizeof(src_addr));
 src__addr.nl_family = AF_NETLINK;      
 src_addr.nl_pid = getpid();  /* self pid */
 src_addr.nl_groups = 0;  /* not in mcast groups */
 bind(sock_fd, (struct sockaddr*)&src_addr, 
      sizeof(src_addr));
 memset(&dest_addr, 0, sizeof(dest_addr));
 dest_addr.nl_family = AF_NETLINK;
 dest_addr.nl_pid = 0;   /* For Linux Kernel */
 dest_addr.nl_groups = 0; /* unicast */
 nlh=(struct nlmsghdr *)malloc(
                         NLMSG_SPACE(MAX_PAYLOAD));
 /* Fill the netlink message header */
 nlh->nlmsg_len = NLMSG_SPACE(MAX_PAYLOAD);
 nlh->nlmsg_pid = getpid();  /* self pid */
 nlh->nlmsg_flags = 0;
 /* Fill in the netlink message payload */
 strcpy(NLMSG_DATA(nlh), "Hello you!");
 iov.iov_base = (void *)nlh;
 iov.iov_len = nlh->nlmsg_len;
 msg.msg_name = (void *)&dest_addr;
 msg.msg_namelen = sizeof(dest_addr);
 msg.msg_iov = &iov;
 msg.msg_iovlen = 1;
 sendmsg(fd, &msg, 0);
 /* Read message from kernel */
 memset(nlh, 0, NLMSG_SPACE(MAX_PAYLOAD));
 recvmsg(fd, &msg, 0);
 printf(" Received message payload: %s\n", 
        NLMSG_DATA(nlh));
    
 /* Close Netlink Socket */
 close(sock_fd);
}

And, here is the kernel code:


struct sock *nl_sk = NULL;
void nl_data_ready (struct sock *sk, int len)
{
  wake_up_interruptible(sk->sleep);
}
void netlink_test() {
 struct sk_buff *skb = NULL;
 struct nlmsghdr *nlh = NULL;
 int err;
 u32 pid;     
 nl_sk = netlink_kernel_create(NETLINK_TEST, 
                                   nl_data_ready);
 /* wait for message coming down from user-space */
 skb = skb_recv_datagram(nl_sk, 0, 0, &err);
 nlh = (struct nlmsghdr *)skb->data;
 printk("%s: received netlink message payload:%s\n", 
        __FUNCTION__, NLMSG_DATA(nlh));
 pid = nlh->nlmsg_pid; /*pid of sending process */
 NETLINK_CB(skb).groups = 0; /* not in mcast group */
 NETLINK_CB(skb).pid = 0;      /* from kernel */
 NETLINK_CB(skb).dst_pid = pid;
 NETLINK_CB(skb).dst_groups = 0;  /* unicast */
 netlink_unicast(nl_sk, skb, pid, MSG_DONTWAIT);
 sock_release(nl_sk->socket);
}

After loading the kernel module that executes the kernel code above, when we run the user-space executable, we should see the following dumped from the user-space program:

Received message payload: Hello you!

And, the following message should appear in the output of dmesg:

netlink_test: received netlink message payload: 
Hello you!

Multicast Communication between Kernel and Applications

In this example, two user-space applications are listening to the same netlink multicast group. The kernel module pops up a message through netlink socket to the multicast group, and all the applications receive it. Here is the user-space code:


#include 
#include 
#define MAX_PAYLOAD 1024  /* maximum payload size*/
struct sockaddr_nl src_addr, dest_addr;
struct nlmsghdr *nlh = NULL;
struct iovec iov;
int sock_fd;
void main() {
 sock_fd=socket(PF_NETLINK, SOCK_RAW, NETLINK_TEST);
 memset(&src_addr, 0, sizeof(local_addr));
 src_addr.nl_family = AF_NETLINK;       
 src_addr.nl_pid = getpid();  /* self pid */
 /* interested in group 1<<0 */  
 src_addr.nl_groups = 1;
 bind(sock_fd, (struct sockaddr*)&src_addr, 
      sizeof(src_addr));
 memset(&dest_addr, 0, sizeof(dest_addr)); 
 nlh = (struct nlmsghdr *)malloc(
                          NLMSG_SPACE(MAX_PAYLOAD));
 memset(nlh, 0, NLMSG_SPACE(MAX_PAYLOAD));      
    
 iov.iov_base = (void *)nlh;
 iov.iov_len = NLMSG_SPACE(MAX_PAYLOAD);
 msg.msg_name = (void *)&dest_addr;
 msg.msg_namelen = sizeof(dest_addr);
 msg.msg_iov = &iov;
 msg.msg_iovlen = 1;
 printf("Waiting for message from kernel\n");
 /* Read message from kernel */
 recvmsg(fd, &msg, 0);
 printf(" Received message payload: %s\n", 
        NLMSG_DATA(nlh));
 close(sock_fd);
}

And, here is the kernel code:


#define MAX_PAYLOAD 1024 
struct sock *nl_sk = NULL;
void netlink_test() {
 sturct sk_buff *skb = NULL;
 struct nlmsghdr *nlh;
 int err;
 nl_sk = netlink_kernel_create(NETLINK_TEST, 
                               nl_data_ready);
 skb=alloc_skb(NLMSG_SPACE(MAX_PAYLOAD),GFP_KERNEL);
 nlh = (struct nlmsghdr *)skb->data;
 nlh->nlmsg_len = NLMSG_SPACE(MAX_PAYLOAD);
 nlh->nlmsg_pid = 0;  /* from kernel */
 nlh->nlmsg_flags = 0;
 strcpy(NLMSG_DATA(nlh), "Greeting from kernel!");
 /* sender is in group 1<<0 */
 NETLINK_CB(skb).groups = 1;
 NETLINK_CB(skb).pid = 0;  /* from kernel */
 NETLINK_CB(skb).dst_pid = 0;  /* multicast */
 /* to mcast group 1<<0 */
 NETLINK_CB(skb).dst_groups = 1;
 /*multicast the message to all listening processes*/
 netlink_broadcast(nl_sk, skb, 0, 1, GFP_KERNEL);
 sock_release(nl_sk->socket);
}

Assuming the user-space code is compiled into the executable nl_recv, we can run two instances of nl_recv:


./nl_recv &
Waiting for message from kernel
./nl_recv &
Waiting for message from kernel

Then, after we load the kernel module that executes the kernel-space code, both instances of nl_recv should receive the following message:

Received message payload: Greeting from kernel!
Received message payload: Greeting from kernel!

Conclusion

Netlink socket is a flexible interface for communication between user-space applications and kernel modules. It provides an easy-to-use socket API to both applications and the kernel. It provides advanced communication features, such as full-duplex, buffered I/O, multicast and asynchronous communication, which are absent in other kernel/user-space IPCs.

Kevin Kaichuan He (hek_u5@yahoo.com) is a principal software engineer at Solustek Corp. He currently is working on embedded system, device driver and networking protocols projects. His previous work experience includes senior software engineer at Cisco Systems and research assistant at CS, Purdue University. In his spare time, he enjoys digital photography, PS2 games and literature.

The URL of this article: http://www.linuxjournal.com/article/7356

幽灵狼 2005-12-17 17:31 发表评论

I/O in FreeBSD

幽灵狼 — Wed, 14 Dec 2005 03:19:00 GMT

buf(9) manual

幽灵狼 2005-12-14 11:19 发表评论

Linux国际化本地化和中文化

幽灵狼 — Tue, 15 Nov 2005 03:39:00 GMT

Author: 于明俭

一国际化、本地化和中文化

国际化、本地化和多语言化的概念

一般来说， "国际化"是指把原来为英文设计的计算机系统或应用软件改写为同时支持多种语言和文化习俗的过程．在软件创作的初期，一般的编程语言，编译，开发都是只支持英文的，为了适应更广的语言和文化习俗，软件有必要在设计结构和机制上支持多语言的扩展特性，这一过程称为国际化．国际化仅仅是在软件设计上提供了使用多语言的可能．

"本地化"是指把计算机系统或者应用软件转变为使用并兼容某种特定语言的过程．比如，把原来为英文设计软件制作为支持中文的软件就是本地化的一种．它主要包括翻译文本信息，界面信息，重新设计图标等等．

语言和文化习俗因地域不同而差别很大．对某一特定的地域的语言环境称为"locale"．它不仅包括语言和货币单位，而且还包括数字标示格式，日期和时间格式．国际化了的软件含有一个"locale" 的"参量"，使用该"locale"参量便可以设置某一区域所用的语言环境．

在国际化部分中只处理语言的部分叫"多语言化"．比如，一个 "多语言化"的软件可以同时管理诸如英语，法语，中日韩文，阿拉伯语等．

在英文中，国际化（Internationalization）被缩写为I18N，即只取首尾两个字母，中间字母为18个．同样地，本地化（Localization）缩写为L10N，多语言化（Multilingualization）缩写为M17N．

在今天， Internet把世界各地的计算机联接了起来，共享信息和技术是必然的趋势和需要．因此各地的计算机系统可以互相交流变得越来越重要．在Linux系统向桌面普及的过程中， Linux软件也需要国际化和本地化．

中文化

"中文化"是一个很模糊的概念．在Linux上的"中文化"它既包含使软件或系统国际化，又包含使软件本地化．也就是说， "中文化"不仅仅是只把软件本地化这么简单的事情，更重要的是因为Linux直接支持中文的软件太少，做"中文化"必须先做"国际化"．

由于历史的原因，现阶段使用的中文又有简体中文和繁体中文之分．所使用的编码也不同．支持中文的软件应该同时支持简体中文和繁体中文，这对软件的国际化提出了更高的要求．

1999年是中国Linux发展和普及过程中最重要的一年，其中涌现了许多制作中文Linux发布版本的公司．中文Linux的技术都是采取了中文化的捷径----中文平台．尽管都是中文平台，但是具体实现的技术特点各不相同．充分展示了中文平台在Linux中文化过程中的魅力．中文平台在短期内发挥了巨大的作用，加速Linux的中文化过程并推动Linux在中国的普及．

中文平台的主要技术特点是不用修改西文应用软件，便可以显示和输入中文（有的情况下会失效）．具体地说，就是利用自己的规范去修改 X系统的底层函数．从修改的层次上分为（1）修改函数库libX11.so，这种方式是动态修改，又称外挂方式．外挂方式的实现可以是直接修改X11库或使用LD_PRELOAD载入动态库修改．（2）修改X Server部分，又称内嵌方式，它的实现也分为两种，直接修改X Server部分和建立虚拟Display（X传输协议的部分代理）．

X11 国际化的历史和级别

早期的X11R4版本中，仅仅含有支持单字节和双字节字体的函数，所以它不能算是国际化的函数库．此后，一个叫做"mltalk"的X协会成立并着手研究X窗口系统的国际化问题．众多的X窗口系统供应商也参与了该组织．因为对国际化的研究刚刚开始，所以mltalk提出的了一个基本问题：什么是X窗口系统的国际化？对它的解释也各不相同．实际上，即使是现在，人们对国际化的定义仍然存在分歧，分歧的焦点主要集中于对软件或系统怎样程度的国际化才算是真正的国际化．

按国际化的级别来分，下列几种情况都属于国际化：

语言可以切换．在系统启动时可以设置某种语言
使用不同语言的软件可以同时使用，在应用软件启动时可以设置某种语言
使用不同语言的软件可以同时使用，而且应用软件的语言可以动态切换
使用不同语言的软件可以同时使用，而且在应用软件中可以同时使用不同语言

基于上述观点， X11R5 的目的是，创建支持不用重新编译源代码就可以适应于语言环境的应用软件开发平台．确切地说，就是国际化的结构是基于标准C函数setlocale的．X11R5 确立了以下规范：

切换语言的机制
与语言无关的输出接口
与语言无关的输入接口
资源文件的国际化
X工具（Xt）的国际化

X11R6 解决了X11R5中存在的问题，主要的变化有，

定义了标准的输入协议
Locale数据格式定义
只采用了一种国际化工具的样本应用模块

国际化标准组织

这里所说的国际化标准是国际化标准组织或一些相关组织制定的一些标准，而且这些标准也会随时间不同而经常更新．国际化标准涉及到字符集，编码，字体处理，打印，文本绘制，用户界面，语言输入方法，数据交换，文化习俗，等方方面面．

下面列出一些制定国际化标准的组织：

Li18nux（Linux I18n）
ANSI（American National Standards Institute）
POSIX（Portable Operating System Interface for Computer Environments）
ISO（International Standards Organization）
IEEE（Institute of Electrical and Electronics Engineers）
Unicode Consortium
Open Group（X Consortium and OSF）
X/Open and XPG

国际化的意义

国际化，特别是国际化中制定的标准，是当今开发国际化软件所必须的．它也是软件开发的必然趋势．遵循国际化标准，可以更高效地开发和调试软件和移植软件，降低软件的开发费用，使用户更方便地使用软件．从国际环境来看，新开发的基本的库函数都会支持国际化标准，基于这些函数库所开发的应用软件理所当然地支持国际化标准，同时有大批的Linux 爱好者把以前不符合国际化标准的软件进行了改造，使它们在一定程度上符合国际化标准．使用国际化标准的软件，淘汰非国际化标准的软件已成为一种趋势．

从国际化的发展历史看，其中许多标准都有日本的商业机构参与，支持日文的软件变得越来越多，而从日语软件移植为中文软件相对于直接移植西文软件相当容易，有时甚至不用改动，这样就节省了许多不必要的劳动．反过来，符合国际化标准的中文软件又影响日语和韩语软件，成滚雪球之势向前发展．其次，软件商的开发比较看好亚洲市场中的日本市场，在 Unix/Linux上的日语软件或操作系统一般是符合国际化标准的，所以兼容这一标准是十分必要的．当然，目前的国际化标准也存在不足之处，特别是对中文这一特殊语言（因为含有GB和Big5两种不能共存的编码）的处理上，应该由中国人在原来的基础上作相应的扩展．

对中文Linux来说，遵循国际化也是必然的趋势．在以中文平台为基础的中文Linux上，软件移植已成为必须解决的问题，这个问题的最终解决方法就是遵循同一标准，就目前来说遵循国际化标准是唯一的方法．鉴于目前中文Linux上的中文平台的混乱状态，国际化标准是从无序到有序过渡的必然途径．

软件的国际标准化也为最终用户带来极大的好处，如同时支持简体中文和繁体中文，中文操作为双字节操作，中文输入能够在更大的程度上使用标准输入接口带来的好处，如输入服务器的定位等交互式操作．

国际化的另一个特点是工作在应用软件级别，所以国际化不会给X窗口系统带来不稳定性．

参考资料：

Linux I18N： http：//www.li18nux.org/

二 Locale

Locale 的概念

Locale 是ANSI C语言中最基本的支持国际化的标志，对中文Linux来说，如果它支持国际化，那么支持中文Locale是最基本的要求．

Locale 是软件在运行时的语言环境，它包括语言（Language），地域（Territory）和字符集（Codeset）．其格式为：语言[_地域[.字符集]]．如对中文GBK字符集， locale的格式是：zh_CN.GBK．目前Linux上的中文 Locale还不完善， glibc2.1.x中的许多涉及Locale的C函数还不正确．如果用户需要安装中文GBK Locale，可以直接使用TLC6.0中的：

glibc-2.1.2（含有GBK模块）
localedata-zh-0.07
/usr/X11R6/lib/X11/locale/zh_CN.GBK/XLC_LOCALE（X 下的 GBK Locale）

LC_COLLATE，用于比较和排序．排序对中文来说也比较重要，但是现在的glibc中的locale对中文支持有些问题．汉字排序的的方式有许多种，按照发音（汉语拼音）或者汉字笔画来排序是比较容易被接受的．
LC_CTYPE，用于字符分类
LC_MONETORY，用于货币单位
LC_NUMERIC，用于数字显示格式．下面是不同国家的在货币符号和数字格式上的不同：

中国大陆： 1,234.56RMB
美国： $1,234.56
德国： 1.234,56DM

LC_TIME，用于时间和日期．时间可以用12小时或者24小时的格式来计算．在小时和分钟之间可以用逗点或者冒号隔开．下面是一些Locale设置的时间和日期的格式：

中国： 14点20分 2000年三月十四号
英国： 02：20pm 14/03/2000
美国： 02：20pm 03/14/2000
芬兰： 14.20 14.03.2000

LC_MESSAGES，用于国际化信息，主要是提示信息，错误信息，状态信息，标题，标签，按钮和菜单等．

setlocale（LC_ALL， ""）;

如果不成功，该函数返回NULL．函数应该回落到setlocale（LC_ALL，"C"）．

在X中使用Locale

在X的客户程序中使用Locale的机制和在标准C函数中使用Locale的方式一样，除此之外，在X库中还定义了另外两个函数来判断X的locale支持和设置locale 的修饰（XModifier），在X中使用Locale和libX11的基本步骤如下：

setlocale（）：设置当前的locale
XSupportLocale（）：用来判断X是否支持目前设置的locale．
XSetLocaleModifier（）：它用来指定一系列的locale修正值．它的参量的格式是@分类=赋值．目前唯一可用的是输入服务器的名称"im"．如果参量为空，则根据系统的环境变量XMODIFIERS查找．比如在系统上设置了环境变量：

% setenv XMODIFIERS @im=Chinput （csh）或
% export XMODIFIERS=im=Chinput （bash）

则客户程序将查找到输入服务器Chinput， "Chinput"是输入服务器所设置的名称．

文化习俗的差别

下面是在国际化和本地化过程中常常遇到的并且应当注意的地方，对国际化软件的开发，应该充分注意到各个地域的文化和习惯，开发出通用的软件，对于本地化过程，则应选择与本地域相符的习惯．

姓名，地址等特殊信息

图标的通用性

声音使用

颜色使用

纸张尺寸

键盘差别

政治因素

参考资料：

Linux 上的Locale

GBK Locale

三 X 窗口系统的国际化

在 X 窗口系统上的国际化，特别是中文化，主要体现在显示，输入和打印三个方面．

显示的国际化

字符集和编码

在Linux上经常使用的字符集是ISO 8859系列的字符集．它包含了10个多语言的单字节编码字符集．它们分别是，

字符集	涵盖语言
ISO 8859-1（Latin1）	拉丁一字符集，包含绝大多数的欧洲语言，例如French（fr）， Spanish （es）， Catalan （ca）， Basque （eu）， Portuguese （pt）， Italian （it）， Albanian （sq）， Rhaeto-Romanic （rm）， Dutch （nl）， German （de）， Danish （da）， Swedish （sv）， Norwegian （no）， Finnish （fi）， Faroese （fo）， Icelandic （is）， Irish （ga）， Scottish （gd）， English （en）， Afrikaans （af）和 Swahili （sw）．影响了美洲，澳洲和非洲．
ISO 8859-2（Latin2）	拉丁二字符集，包含了中欧和东欧的语言：Czech （cs）， Hungarian （hu）， Polish （pl）， Romanian （ro）， Croatian （hr）， Slovak （sk）， Slovenian （sl）， Sorbian．
ISO 8859-3（Latin3）	拉丁三字符集，包括： Esperanto （eo） and Maltese （mt）
ISO 8859-4（Latin4）	拉丁四字符集，包括： Estonian （et），巴尔地克 Latvian （lv）和 Lithuanian （lt）， Greenlandic （kl）， Lappish．
ISO 8859-5（西里尔语）	Bulgarian （bg）， Byelorussian （be）， Macedonian （mk）， Russian （ru）， Serbian （sr）
ISO 8859-6（阿拉伯语）	阿拉伯语（ar）
ISO 8859-7（希腊语）	希腊语（el）
ISO 8859-8（希伯来语）	Hebrew （iw）和Yiddish （ji）
ISO 8859-9（Latin5）	重排了Latin1，用土耳其语的几个字母做了替换
ISO 8859-9（Latin6）	重排了Latin4，去掉了某些符号，增加了Inuit等
ISO 8859-11（泰国语）	泰国语（th）
ISO 8859-12	Celtic
ISO 8859-13（Latin7）	Baltic Rim 和 Lativian（lv）
ISO 8859-14（Latin8）	Gaelic 和 Welsh （cy）
ISO 8859-15（Latin9）	Latin1的变种，修改了某些字母

双字节字符集主要包含中文，日文和韩文．它由前导字节（Lead Byte）和尾部字节（Trail Byte）构成，由于一个字符采用了两个字节，在软件的国际化方面又增加了一些麻烦，比如在显示上，光标的位置不能位于汉字之间，删除和移动时必须是整字操作等，在输入上，一般需要预编辑服务器才能输入汉字．下表列出了中日韩语言编码的有关信息：

语言	字符集	代码页	前导字节范围	尾部字节范围
简体中文	GB2312-1980	CP936	0xA1-0xF7	0xA1-0xFE
简体中文	GBK	无	0x81-0xFE	0x40-0x7E， 0x80-0xFE
中文繁体	BIG-5	CP950	0x81-0xFE	0x40-0x7E， 0xA1-0xFE
日文	Shift-JIS	CP932	0x81-0x9F， 0xE0-0xFC	0x40-0xFC（0x7F除外）
韩文	KSC-5601-1987	CP949	0x81-0xFE	0x41-0x5A，0x61-0x7A，0x81-0xFE
韩文	KSC-5601-1992	CP1361	0x84-0xD3 0xD8 0xD90-0xDE 0xE0-0xF9 0x41，0xFE	0x41-0x7E 0x81-0xFE 0x31-0x7E

最近，信息产业部和国家质量技术监督局联合发布了两项新的中文信息处理基础性国家标准，为解决偏、生汉字的输入提供了方案。其中GB18030- 2000《信息技术和信息交换用汉字编码字符集、基本集的扩充》，为强制性国家标准．它收录了2.7万多个汉字，总编码空间超过150万个码位，为彻底解决邮政、户政、金融、地理信息系统等迫切需要的人名、地名用字问题提供了解决方案，也为汉字研究、古籍整理等领域提供了统一的信息平台基础。这项标准还同时收录了藏文、蒙文、维吾尔文等主要的少数民族文字．字符集编码范围是：

字节数	编码空间	码位数目
单字节	0x00-0x80	129
双字节	第一字节：0x81-0xFE 第二字节：0x40-0x7E，0x80-0xFE	23940
四字节	四字节范围分别是： 0x80-0xFE，0x30-0x39，0x81-0xFE，0x30-0x39	1587600

香港特别行政区也对Big5编码提出了"香港增补字符集"，其目的，是收纳香港特区政府及市民在中文电子通讯中有需要使用的字符，来补充目前大五码和ISO10646编码标准内并未包含的字符，以作为一个通用的中文界面，方便大家能准确地以中文进行电子通讯。香港增补字符集有两套编码方案，一套适用於大五码系统，另一套适用於ISO10646平台。香港增补字符集的大五码版本，实际上是政府通用字库的增订版。ISO10646国际编码标准目前并未包含香港增补字符集内的所有字符。目前尚未收纳在ISO10646内的香港增补字符集字符，均已提交国际标准化组织管辖下的表意文字小组，以考虑是否纳入ISO10646日后的新增版本内．

上述标准和草案应该是以后的中文Linux所应该遵循的．

多字节字符（Multibyte）和宽字符（WideChar）的使用

我们平时见到的以文本方式存在的字符都是多字节字符，它主要用于文件存储和网络上的以流（Stream）的方式传输．一个GB编码的汉字需要两个字节．多字节字符的缺点是在中文处理上不方便，比如汉字的删除和光标的移动都会有半汉字问题．为了文本处理的方便，在内部操作上通常是把汉字与英文的混和字符串先转换成等宽度的字符串，即宽字符，为软件的内部处理提供方便．

glibc2.1.x中多字节字符串和宽字符串的转换有时有问题．在X下还可以使用另外一种方式完成转换，即使用XmbTextListToTextProperty（）和 XwcTextPropertyToTextList（）联合完成转换．

Unicode

目前所使用的Unicode 是一种16位字宽的字符编码，它由非赢利的计算机组织Unicode研讨会维护和改进．它起源于Xerox和Apple之间的合作研究．几个公司组成了一个非正式的论坛，接着IBM， Microsoft等公司迅速加入． Unicode研讨会在1990年发表了Unicode标准版本1，同时国际标准化组织完成了一种类似的编码----ISO 10646．因为没有必要存在两套标准，所以Unicode 研讨会和国际标准化组织在1991到1992合二为一． 1994年，中国和日本开始了基于ISO10646上的国家标准进行工作．现在， Unicode 开始用在许多产品中．

Unicode包含了当今计算机领域中广泛使用的所由字符，如世界上大部分的书面语言，印刷字符，数字和技术符号，地理图形和标点符号．由于Unicode 的一致性，它在大多数情况下都可能简化软件的国际化过程．它取消了处理多种代码页的必要，并且由于是16位编码，因此由双字节字符集所引起的额外处理也不必要了．

但是， Unicode作为一种编码也有它的缺陷，比如编码的位置与排序无关，所以使软件支持Unicode仅仅是国际化的第一步，实际情况中还需要与语言相关的信息和规则．所以Unicode一般作为程序的内部处理编码，必须提供与其它编码的双向转换表．

最后需要说明的是，虽然使用Unicode会使普通的英文文本大两倍，但是使用Unicode的整个系统却不会增加太大，因为系统存放的文件大部分是二进制文件格式，同时，使用针对Unicode的压缩方式，可以把文件压缩成和使用对应的8位正文一样大小．

字体（Font）和字体集（FontSet）

在X窗口系统下使用的字体都必须在X服务器中注册X逻辑字体描述（X Logical Font Description）名．它包括了字体的许多信息，例如以下为西文字体和中文字体的两个例子．

-adobe-times-medium-r-normal--14-140-75-75-p-74-iso8859-1
-tlc-song-medium-r-normal--24-240-75-75-c-240-gbk-0

X 字体也可以通过字体服务器（Font Server）加载．这对于本地不放字体的系统或X终端特别有用．加载的协议可以是TCP或DECNET．

X 窗口系统的字体在X Server中之存在一份，当所由软件都不使用它时，字体的内存自动施放．

字体中包含了制造商名，字体类型，权重，字体大小，字符集等信息．它们也可以缩写，省去的部分用星号代替，比如对上面的中文字体，可以缩写为：

-*-song-*-24-*-gbk-0

在实际应用中，字符串往往是中文和英文的混和字符串，所以必须使用两种字体来绘出该字符串，这种指定两种或两种以上的字体的描述就是字体集．字体集一般的格式是把多种字体用逗号隔开，比如，指定下列字体集：

"-adobe-helvetica-medium-r-normal--14-*-*-*-*-*-iso8859-*，\

-tlc-song-medium-r-normal--14-*-*-*-*-*-gbk-0"

令人遗憾的是，中文的GB编码和Big5编码有重叠区域，不能区分开来，所以字体集并不能同时指定GB和Big5的字体．

字体集的具体载入受到Locale的影响．

在许多已经国际化的软件和图形库中，一般通过资源文件让用户指定字体集，比如gtk的简体中文资源文件为/etc/gtk/gtkrc.zh_CN， qt-1.44（国际化的）的资源文件是 ~/.qti18nrc 等等．

信息的国际化

信息（Message）国际化是软件国际化中比较重要的一环，如果使软件可以支持多种语言，在设计时就应当考虑到信息的国际化问题．现在的绝大多数软件使用GNU的gettext作为基本工具．信息国际化的基本步骤是：

在软件初始化时设置使用setlocale（）设置Locale
使用gettext宏定义，使程序看上去比较方便：
指定信息的位置：
指定翻译信息： _（"Some Strings"）;
在软件完成后，使用 xgettext 提取信息并翻译
使用msgfmt把信息文件转换为.mo文件，安装到locale目录下

        /* file this_app.c */
        #include 
        #include 
        #define _（String）  gettext（String）
        #define N_（String）  gettext（String）
        #define __（String） （String）

        int main（）{
                //由环境变量决定locale
                setlocale（LC_ALL， ""）;               

                //设置message的位置和文件名
                bindtextdomain（"this_app"， "/usr/share/locale"）;
                textdomain（"this_app"）;
                
                printf（_（"Some String"））;
        }
        

     至此， 本程序的国际化过程已完成．编译并联接成可执行文件this_app.

gcc -o this_app this_app.c

下面是本地化的过程.

提取要翻译的信息： xgettext -a -o this_app.po this_app.c
翻译信息

在文件this_app.po 中含有"Some String"：

msgid "Some String"

msgstr ""

翻译成：

msgid "Some String"

msgstr "一些字符串"

格式化信息文件： msgfmt -o this_app.mo this_app.po
拷贝信息文件到locale的目录下，比如对于中文zh_CN， cp this_app.mo /usr/share/locale/zh_CN/LC_MESSAGES
执行文件： LC_ALL=zh_CN ./this_app

输入的国际化

在X窗口系统下输入主要有三种方式：

单次击键输入单字符
两个或多个组合键输入单字符
除键输入外，还需要转换服务器

其中前两种用于输入西文字符，比如对于欧洲语言的特殊字符的输入，通常采用重映射键盘的方法．或者使用"加速键"的方法输入，加速键是键盘上的特殊键，按下后不会使光标向后移动．

在Linux下，使用软件xkeycaps可以把键盘重新映射并且保存整个键盘在映射后的对照表，使用命令xmodmap可以加载映射表．

对于中文输入，主要使用第三种输入方式．针对各种语言的综合考虑， X 窗口系统在输入上定义了下列区域：

预编辑区域（Preedit Area），用于显示输入的过程，当用户输入字符时，应立即显示在该区域
状态区域（Status Area），用于显示输入状态，对中文来说，用于显示输入方法，全角/半角状态，中文/西文标点符号状态．
辅助区域（Auxiliary Area），显示可供选择的列表，又称选择区域，它由输入服务器控制．

根据预编辑区域和状态区域的不同组合， X 窗口系统共定义了四种输入的风格（Input Style）：

Root风格：预编辑区域和选择区域都在应用软件之外，它们都是由输入服务器完成的，输入服务器所显示的界面是根窗口的子窗口．如类似"中文之星"的独立的输入条模式．
OffTheSpot风格：预编辑区域和选择区域在应用软件之内，通常是在窗口下方的某个固定区域内．如XEmacs的缺省输入模式．
OverTheSpot风格：预编辑区域在当前的输入位置，状态区域在应用程序的某一固定区域．它通常又称为光标跟随模式，类似于Windows下的智能ABC输入方法
OnTheSpot风格：预编辑区域和选择区域都在应用软件之内，内容是由输入服务器发送的，应用程序负责显示．

对中文输入来说，最好的风格是（3），（4），（1）．对大部分中文输入方法，必须弹出辅助区域，供用户选择，只有少数的中文输入方法，如五笔字型，比较适合（4）．对于状态区域，中文输入多数选用在Root风格的窗口的某个位置或使用专用的控制条．在MS Windows下比较常用的光标跟随模式，可以用（3），（4）来实现．鉴于Linux下有的用户把X Window设置成为虚屏模式，选择上述的任何一种模式都不尽满意．

对应用软件来说，最简单的输入接口是Root风格，它把显示部分交给输入服务器去做．编写软件时所用的代码量少，是对软件初步使用国际化标准的最佳选择．从方便用户的角度来看，应用软件，特别是高层的库函数应该同时支持四种输入风格．令人遗憾的是，一般软件仅支持两到三种输入风格．所以在现在的输入服务器（IM Server）也很少支持四种风格，这似乎成了鸡和蛋的问题．

下面列出几种常用软件和图形库的XIM支持情况：

Netscape	Root，OffTheSpot，OverTheSpot
Java	Root，OnTheSpot
Qt	Root，OverTheSpot
gtk+	Root，OverTheSpot
rxvt	Root，OffTheSpot，OverTheSpot

中文输入需要客户软件和服务器软件的的密切配合，它们之间是通过 XIM（X Input Method）协议来通讯的．输入服务器首先起动，在X Server里注册自己，服务器的名字也被注册．当客户程序起动时，到X Server里查寻有没有符合自己locale类型的输入服务器（如果用XMODIFIERS指定服务器名，则同时用locale和名字区分）．找到后，根据输入服务器提供的风格种类选择一个最适合自己的风格．然后客户程序为每一个需要输入的窗口都建立一个自己的标示IC（Input Context），里面含有客户程序的信息，以后的通讯则一直使用该标示．

下面是直接使用X Lib和服务器联接的过程，在高层函数库中，把这一过程隐藏了起来：

             XIM im;
             XIC ic;
             ..．
             if（ （im = XOpenIM（display， NULL， NULL， NULL）） == NULL ） {
                     printf（"Error ： XOpenIM !\n"）;
                     exit（0）;
             }

             //指定预编辑的类型等..．
             if（ （ic = XCreateIC（im， 
                     XNInputStyle，   XIMPreeditPosition | XIMStatusNothing，
                     XNClientWindow， window， 
                     NULL）） == NULL ） {
                     printf（"Error ： XCreateIC（） ! \n"）;
                     XCloseIM（im）;
                     exit（0）;
             }
             ..．

             for（;;） {
                     XNextEvent（display， &event）;

                     //如果输入服务器接收并处理...继续
                     if （XFilterEvent（&event， None） == True）
                             continue;
                     switch（event.type） {
                             case Expose：
                                     XmbDrawString（...）;
                             case KeyPress：
                                     count = XmbLookupString（ic， 
                                             （XKeyPressedEvent *） &event，
                                             string， len， &keysym， &status）;
                             ..．
                     }
             }

目前使用比较广泛的XIM输入服务器有Chinput（简体中文，同时支持繁体）， xcin（繁体中文）， kinput2（日文）和 hanIM/ami（韩文）．

中文输入服务器Chinput 选择了OverTheSpot风格作为缺省的输入模式，它与标准的输入风格略有不同，即把预编辑区域偏离输入位置，使输入区域同时作为状态区域，在很大程度满足了用户的输入习惯．同时它还使用辅助工具条显示和改变输入状态．Chinput还解决了同时使用GB和Big5编码的问题，被动输入（Passive Input）问题等．对于普通用户，除了使用键盘输入外，还可以使用手写识别输入和语音识别输入方式．目前的输入架构基本能够满足它们的要求．笔者在手写识别输入方面做了一些尝试，发现对绝大部分软件是能够适合被动输入的．

打印的国际化

在X窗口系统下的打印是一个很难解决的问题，所以到目前为止没有形成一个统一的打印标准．其原因之一就是X窗口系统在设计上把显示和打印完全分开了．

在Linux最常见的需要打印的文件格式是普通文本文件和PostScript文件．对于中文的普通文本文件的打印一般需要先转换为PostScript文件再打印．对于PostScript文件，如果应用软件在生成时含有中文字体信息，则打印比较容易实现，反之，则很难实现甚至不可能打印．

目前中文文本文件常用的打印方法通常是，使用gb2ps/bg2ps/cnprint 等软件转换成PS文件打印，转换过程使用了中文的点阵字体．对已经形成的PS 文件的打印，如果不包含中文字体，直接打印就会输出乱码，通常使用的方法是将这一类PS文件过滤一下，改为使用中文字体，然后再打印．如陈向阳先生的过滤软件ps2cps可以打印Netscape的存储文件．这种打印的缺点是有时输出的PS中汉字字符串和英文字符串对不齐．最好的方法是在 PostScript一级实现中文打印，陈向阳先生对ghostscript进行了中文化，可以直接使用TTF轻松打印Netscape， Qt/KDE， lyx等软件输出的PS文件．这种从底层实现打印的方法也是日文和韩文所采用的方法．

使用CID（adobe）字体打印的方法也在尝试之中．

总之，目前的中文打印缺乏统一标准，应用软件在输出打印PS文件时多数不考虑双字节语言的问题，使打印变得更加复杂化，所以当前的中文Linux发布版本多数不支持中文打印，

客户程序间通讯的国际化

客户程序间通讯（Interclient Communications Conventions，简称ICCC）是客户程序之间共享资源的手段之一．最常见的应用是文本的拷贝和粘贴和与窗口管理器通讯．但是如果两个应用程序之间所使用的字符集不同，粘贴就会出现问题，甚至粘贴的内容会丢失．所以客户程序之间必须国际化了的通讯协议．

应用程序和窗口管理器之间的通讯也属于客户程序间通讯．

如果客户程序之间使用的字符集相同，但是编码不同，则不会丢失数据，这时应该使用复合文本（COMPOUND TEXT）传输．X内部定义了COMPOUND_TEXT 的原子（Atom）用于传输中英文混和的字符串．对7字节编码， ASCII或者其它 ISO8859-1字符集，客户程序通讯可以不用转换而直接使用XA_STRING原子传输．

四开发符合国际化标准的软件

在X窗口系统下开发软件，应尽量符合国际化标准．它包括，设置合适的locale（见前面讲述的在X下使用locale），注意选择字符集和字体集，本地化文本的处理，输入方法等等．这里推荐用户尽量使用在国际化方面已经比较完善的高层图形库，如Qt， gtk+， Java等，这样可以避免考虑以上问题．选择Motif时需要考虑资源的国际化问题和FontList等．

开发国际化软件

使用已经支持国际化的高层图形库开发支持国际化的软件基本上可以不用考虑国际化问题．特别是输入问题，在标准的输入区内（单行输入和多行输入），都可以自动输入汉字．在字体处理上，注意使用字体集．许多软件需要在资源文件中指定字体和字体集，所以开发的软件应提供一个缺省支持字体集的资源文件．

下面所介绍的开发国际化的软件是基于libX11的开发方法．除了前面所说的在软件初始化时调用一些Locale的函数外，在实际编程时，还应注意以下问题：

字体载入：在处理字符串时，使用FontSet，而不是Font：

XCreateFontSet（） - 建立字体集
XFreeFontSet（） - 释放字体集内存
XFontsOfFontSet（） - 返回XFontStruct和字体名
XBaseFontNameListOfFontSet（） - 返回字体集的名称
XLocaleOfFontSet（） - 返回XFontSet的locale名
XExtentsOfFontSet（） - 获得FontSet的最大Extents

计算字符串的屏幕尺寸并画字符串：

Xmb/XwcDrawString（） - 只画字型（glyphs）的前景
Xmb/XwcDrawImageString（） - 画前景和背景
Xmb/XwcDrawText（） - 复杂的间隔和字体集
Xmb/XwcTextEscapement（） - X 方向像素
Xmb/XwcTextExtents（） - 字符串轮廓

客户程序间通讯：

Xmb/wcTextListToTextProperty（） - 根据locale的文本转换
Xmb/wcTextPropertyToTextList（） - 根据locale的文本转换
XFreeStringList（）
Xmb/wcFreeStringList（） - 释放StringList
XSetWMProperties（） - 设置窗口管理器属性
XSetWMName（） - 设置窗口窗口名
XSetWMIconName（） - 设置窗口图标名

输入：

XOpenIM（）/XCloseIM（） - 打开/关闭输入服务器
XDisplayOfIM（）/XLocaleOfIM（）
XSetIMValues（）/XGetIMValues（） - 设置/获取输入服务器属性
XCreateIC（）/XDestroyIC（） - 建立/释放IC
XIMOfIC（）
XSetICValues（）/XGetICValues（） - 设置/获取IC的值
XSetICFocus（）/XUnsetICFocus（） - 聚焦/取消聚焦
XmbResetIC（）/XwcResetIC（） - 重设IC
XFilterEvent（） - 过滤事件
Xmb/wcLookupString（） - 查找字符串
XRegister/UnregisterIMInstantiateCallback（） - 注册/取消回调

使非国际化软件国际化

修改已经存在的非国际化软件，应根据具体情况采用不同的补丁．需要注意的是修改后的软件应与原来的软件兼容，不会对软件以前在西文和其它语言的支持造成影响．Locale应该是软件的语言切换中心点．下面是笔者在修改软件的过程中一些经验，仅供参考．

在软件初始化时设置Locale．
定义gettext的宏，并且把它与信息文件绑定．
对所有静态信息使用gettext
对文本绘制使用字体集代替字体
绘制函数使用X下的多字节或宽字符函数
初始化和XIM服务器的联接
在事件循环中用XFilterEvent（）过滤事件到XIM服务器
使用Xmb/wcLookupString（）查找字符串

五目前中文化中存在的问题

现有的国际化标准中存在许多问题，问题的原因主要出自目前的国际化架构．对于中文化来说，这些问题显得更加突出．

编码动态切换的问题

对中文软件来说，同时支持多内码（GB和Big5）是比较完善的中文软件，但是动态切换内码，特别是切换软件界面（如菜单项）的内码，是受到信息（Message）国际化中 gettext 的限制的．一般来说，一旦软件载入，所有文本信息便被初始化，而且在整个过程中不会再重新装载信息．退一步说，即使重新装载了信息，由于所装载信息的长度发生了变化，软件界面调整布局也是十分困难的．

所以现有软件的动态编码切换仅仅是在部分区域实现，例如Netscape．遗憾的是， Netscape的编码切换并不彻底，它切换的仅仅是显示部分，输入部分仍然有问题．比如在zh_CN.GBK的环境下启动Netscape，当切换到有输入条的繁体中文页面时，如果采用输入软件自动识别Input Context的编码的方式，仍然会认为Netscape是GB编码，输入结果不正确．如果输入 Big5编码，必须缩定输出的编码为Big5．Chinput在这方面做了一些尝试，结论是可以输入Big5编码，但是在输入条中的显示不正确．

一般来说，使用中文平台来动态切换编码更容易实现．在中文Linux 的发布版本中，有几个是可以使用中文平台来实现动态切换编码的，其原理十分简单，只要在应用程序或X服务器把某个窗口的编码状态记住就行了，以后的文本显示和输入都以此编码为标准．这种方法的缺点是，应用程序初始界面上的中文由于转化了编码变成了乱码．

中文编码自动识别问题

在文本浏览，网页浏览或网页翻译时，通常需要自动识别汉字的内码，但是中文的GB编码和Big5编码有重叠区域，所以很难区分开．目前公开源代码的识别软件很少，识别结果不能令人满意，远没有达到目前商业软件的识别水平．

Linux上的中文平台到国际化的过渡

但是从长远的角度看，因为中文在对中文显示和输入上与国际化标准存在很大差异，所以亟需一种从中文平台到国际化标准的过渡性方案．在过渡的过程中，中文平台可能会和国际化标准共同存在一段时间．

以CLE和TurboLinux为例，它们在早期的版本中都采用了中文平台来支持中文的显示和输入，随着支持国际化标准的软件的增多，逐步采用了中文平台和国际化标准共同存在的版本作为过渡性版本．到目前为止，已经在缺省情况下放弃中文平台的使用．中文平台只是作为残留物包含在发布版本中．

Linux 文档中文化

Linux文档，主要是指Linux上的一些命令帮助文档（man文件），软件手册和说明，软件本身的Message文件（po）．目前在这方面的工作还缺乏统一的管理和广大Linux爱好者的参与．

参考资料

Unicode： http：//www.unicode.org/
香港增补字符集： http：//www.digital21.gov.hk/chi/hkscs/introduction.html
CJK 有关信息： ftp：//ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
Linux国际化资料： http：//i18n.linux.org.tw/
Linux国际化标准： http：//www.li18nux.org/
MicroSoft 国际化： http：//www.microsoft.com/globaldev/

六附录

宽字符处理函数函数与普通函数对照表

字符分类：

宽字符函数	普通C函数	描述
iswalnum（）	isalnum（）	测试字符是否为数字或字母
iswalpha（）	isalpha（）	测试字符是否是字母
iswcntrl（）	iscntrl（）	测试字符是否是控制符
iswdigit（）	isdigit（）	测试字符是否为数字
iswgraph（）	isgraph（）	测试字符是否是可见字符
iswlower（）	islower（）	测试字符是否是小写字符
iswprint（）	isprint（）	测试字符是否是可打印字符
iswpunct（）	ispunct（）	测试字符是否是标点符号
iswspace（）	isspace（）	测试字符是否是空白符号
iswupper（）	isupper（）	测试字符是否是大写字符
iswxdigit（）	isxdigit（）	测试字符是否是十六进制的数字

大小写转换：

宽字符函数	普通C函数	描述
towlower（）	tolower（）	把字符转换为小写
towupper（）	toupper（）	把字符转换为大写

字符比较：

宽字符函数	普通C函数	描述
wcscoll（）	strcoll（）	比较字符串

日期和时间转换：

宽字符函数	描述
strftime（）	根据指定的字符串格式和locale设置格式化日期和时间
wcsftime（）	根据指定的字符串格式和locale设置格式化日期和时间，并返回宽字符串
strptime（）	根据指定格式把字符串转换为时间值，是strftime的反过程

打印和扫描字符串：

宽字符函数	描述
fprintf（）/fwprintf（）	使用vararg参量的格式化输出
fscanf（）/fwscanf（）	格式化读入
printf（）	使用vararg参量的格式化输出到标准输出
scanf（）	从标准输入的格式化读入
sprintf（）/swprintf（）	根据vararg参量表格式化成字符串
sscanf（）	以字符串作格式化读入
vfprintf（）/vfwprintf（）	使用stdarg参量表格式化输出到文件
vprintf（）	使用stdarg参量表格式化输出到标准输出
vsprintf（）/vswprintf（）	格式化stdarg参量表并写到字符串

数字转换：

宽字符函数	普通C函数	描述
wcstod（）	strtod（）	把宽字符的初始部分转换为双精度浮点数
wcstol（）	strtol（）	把宽字符的初始部分转换为长整数
wcstoul（）	strtoul（）	把宽字符的初始部分转换为无符号长整数

多字节字符和宽字符转换及操作：

宽字符函数	描述
mblen（）	根据locale的设置确定字符的字节数
mbstowcs（）	把多字节字符串转换为宽字符串
mbtowc（）/btowc（）	把多字节字符转换为宽字符
wcstombs（）	把宽字符串转换为多字节字符串
wctomb（）/wctob（）	把宽字符转换为多字节字符

宽字符函数	普通C函数	描述
fgetwc（）	fgetc（）	从流中读入一个字符并转换为宽字符
fgetws（）	fgets（）	从流中读入一个字符串并转换为宽字符串
fputwc（）	fputc（）	把宽字符转换为多字节字符并且输出到标准输出
fputws（）	fputs（）	把宽字符串转换为多字节字符并且输出到标准输出串
getwc（）	getc（）	从标准输入中读取字符，并且转换为宽字符
getwchar（）	getchar（）	从标准输入中读取字符，并且转换为宽字符
None	gets（）	使用fgetws（）
putwc（）	putc（）	把宽字符转换成多字节字符并且写到标准输出
putwchar（）	getchar（）	把宽字符转换成多字节字符并且写到标准输出
None	puts（）	使用fputws（）
ungetwc（）	ungetc（）	把一个宽字符放回到输入流中

字符串操作：

宽字符函数	普通C函数	描述
wcscat（）	strcat（）	把一个字符串接到另一个字符串的尾部
wcsncat（）	strncat（）	类似于wcscat（），而且指定粘接字符串的粘接长度.
wcschr（）	strchr（）	查找子字符串的第一个位置
wcsrchr（）	strrchr（）	从尾部开始查找子字符串出现的第一个位置
wcspbrk（）	strpbrk（）	从一字符字符串中查找另一字符串中任何一个字符第一次出现的位置
wcswcs（）/wcsstr（）	strchr（）	在一字符串中查找另一字符串第一次出现的位置
wcscspn（）	strcspn（）	返回不包含第二个字符串的的初始数目
wcsspn（）	strspn（）	返回包含第二个字符串的初始数目
wcscpy（）	strcpy（）	拷贝字符串
wcsncpy（）	strncpy（）	类似于wcscpy（），同时指定拷贝的数目
wcscmp（）	strcmp（）	比较两个宽字符串
wcsncmp（）	strncmp（）	类似于wcscmp（），还要指定比较字符字符串的数目
wcslen（）	strlen（）	获得宽字符串的数目
wcstok（）	strtok（）	根据标示符把宽字符串分解成一系列字符串
wcswidth（）	None	获得宽字符串的宽度
wcwidth（）	None	获得宽字符的宽度

另外还有对应于memory操作的 wmemcpy（）， wmemchr（）， wmemcmp（）， wmemmove（）， wmemset（）．

X 窗口系统下支持中文的函数

支持西文的函数	支持中文的函数	描述
XLoadFont	XCreateFontSet	载入字体集
XTextExtents（16）	Xmb/wcTextExtents Xmb/wcTextPerCharExtents	返回文本的限制框
XDrawString	Xmb/wcDrawString	在窗口中画字符串，背景填充
XDrawImageString	Xmb/wcDrawImageString	在窗口中画字符串
XDrawText	Xmb/wcDrawText	在窗口中画字符串
XLookupString	Xmb/wcLookupString	查找字符串

支持国际化的高层库

OSF/Motif
Qt/kdelib
gtk+/gnome-lib
Perl
Java

支持多语言的典型软件

浏览器 Netscape
编辑器 XEmacs
编辑器 Mule
编辑器 vim
终端 rxvt
排版软件 LaTeX/lyx
PostScript/PDF： gs/acroread
图像处理： gimp
幻灯片制作 mgp
即将完成： StarOffice， KOffice

支持Unicode的软件

高级图形库函数 Qt 2.x
Java 语言开发工具 JDK
编辑器 yudit
专用的支持Unicode的 X 终端
基于GTK+的文本处理器 GScript

幽灵狼 2005-11-15 11:39 发表评论

Linux Unicode编程

幽灵狼 — Tue, 15 Nov 2005 03:39:00 GMT

如何在程序中加入并使用 Unicode 以实现外语支持

Author: Thomas W. Burger

作为一个计算机的多位字符表示系统，Unicode 支持世界上所有语言的编码和转换。这篇文章说明了 Linux 应用程序中的国际语言支持的重要性，以及规划 Unicode 支持并将之结合到 Linux 应用程序中去的思想。

Unicode 并不只是一个编程工具，它还是一个政治的、经济的工具。没有结合世界的语言支持的应用程序通常只能被那些能读写 ASCII 所支持语言的个人使用。这使得建立在 ASCII 基础之上的计算机技术脱离了世界上大部分人。Unicode 允许程序使用世界上任何一种字符集，因此它支持所有语言。

Unicode 让程序员为普通人提供用他们本国语言就能使用的软件。这样就不用再学一门外语了，而且更容易实现计算机技术社会和财政上的利益。很容易设想，如果用户必须为使用因特网浏览器而学习乌尔都语的话，您就难以看到计算机在美国的使用。Web 就更不会出现了。

Linux 承担了对 Unicode 很大程度上的支持。Unicode 支持被嵌入到内核和代码开发库中。在很大程度上，使用程序中几句简单的命令就能将它们自动的结合到代码中。

所有现代字符集的基础都是在 1968 年以 ANSIX3.4 版本出版的美国信息交换标准码（American Standard Code for Information Interchange，ASCII）。一个值得注意的例外是在 ASCII 之前定义的 IBM 的扩充的二进制编码的十进制交换码（Extended Binary Coded Decimal Information Code，EBCDIC）。ASCII 是一个编码字符集（coded character set，CCS），换句话说，它是整数到字符表示的映射。ASCII 编码字符集允许用一个八位（基于二进制的，用值 0 或 1 表示的）字段或字节（2^8 =256）表示 256 个字符。这是一个高度受限的编码字符集，它不能表示许多不同语言的所有字符（如中文和日文），不能表示科学符号，更不能表示古代文字（神秘符号和象形文字）和音乐符号。通过更改一个字节的长度而使更大的字符集得以被编码，这似乎有效但完全不切实际。所有的计算机都基于八位字节。解决方法是一种字符编码方案（Character encoding scheme，CES）― 用定长或变长的多字节序列能够表示比 256 大的数.这些数值接着通过编码字符集被映射到它们表示的字符。

Unicode 的定义
Unicode 通常用作涉及双字节字符编码方案的通用术语。Unicode CCS 3.1 的官方称谓是 ISO10646-1 通用多八字节编码字符集（Universal Multiple Octet Coded Character Set，UCS）。Unicode 3.1 版本添加了 44,946 个新的编码字符。算上 Unicode 3.0 版本已经存在的 49,194 个字符，共计 94,140 个。

Unicode 编码字符集利用了一个由 128 个三维的组构成的四维编码空间。其中每个组包含 256 个二维平面。每个平面由 256 个一维的行组成，并且每个行有 256 个单元。每个单元在这个编码空间内对一个字符编码，或者被声明为未经使用。这种编码概念被称为 UCS-4；四个八位元用来表示指定组、平面、行和单元的每个字符。

第一个平面（第 00 组的第 00 平面）是基本多语言平面（Basic Multilingual Plane，BMP）。BMP 按字母、音节、表意符号和各种符号及数字定义了常规使用的字符。后续的平面用于附加字符或其它还没有发明的编码实体。我们需要这完整的范围去处理世界上的所有语言；特别是拥有将近 64,000 个字符的一些东亚语言。

BMP 被用作双字节的编码字符集，这种编码字符集确定为 ISO 10646 UCS-2 格式。ISO 10646 UCS-2 就是指 Unicode（并且两者相同）。BMP，像所有 UCS 平面那样，包含了 256 行，其中每行包含 256 个单元，字符仅仅按照 BMP 中的行和单元的八位元在单元中被编码。这就允许 16 位编码字符能够被用来书写大多数商业上最重要的语言。UCS-2 不需要代码页切换、代码扩展或代码状态。UCS-2 是一种将 Unicode 结合到软件中的简单方法，但它只限于支持 Unicode BMP。

若要用 8 位字节表示一个多于 2^8 =256 个字符的字符编码系统（character coding system，CCS），就需要一种字符编码方案(character-encoding scheme，CES）。

Unicode 转换
在 UNIX 中，使用得最多的字符编码方案是 UTF-8。它考虑到了对整个 Unicode 全部页和平面的全面支持，而且它仍能正确的识别 ASCII。除了 UTF-8 的其他选择还有：UCS-4、UTF-16、UTF-7.5、UTF-7、SCSU、HTML 和 JAVA。

Unicode 转换格式（Unicode Transformation Formats，UTFs）是一种通过映射多字节编码中的值来支持 Unicode 的字符编码方案。本文将分析最流行的格式 ― UTF-8 字符编码系统。

UTF-8
UTF-8 转换格式正逐步成为一种占主导地位的交换国际文本信息的方法，因为它可以支持世界上所有的语言，而且它还与 ASCII 兼容。UTF-8 使用变长编码。从 0 到 0x7f（127）的字符把自身编码成单字节，而将值更大的字符编码成 2 到 6 个字节。

表 1. UTF-8 编码

0x00000000 - 0x0000007F:		0 xxxxxxx
0x00000080 - 0x000007FF:		110 xxxxx10 xxxxxx
0x00000800 - 0x0000FFFF:		1110 xxxx10 xxxxxx10 xxxxxx
0x00010000 - 0x001FFFFF:		11110 xxx10 xxxxxx10 xxxxxx 10 xxxxxx
0x00200000 - 0x03FFFFFF:		111110 xx10 xxxxxx10 xxxxxx10 xxxxxx 10 xxxxxx
0x04000000 - 0x7FFFFFFF:		1111110 x10 xxxxxx10 xxxxxx10 xxxxxx 10 xxxxxx10 xxxxxx

字节 10 xxxxxx是一个扩展字节，它的 xxxxxx 位位置被以二进制表示的字符代码号的位所填充。这是能够代表被使用代码的最短的可能的多字节序列。

UTF-8 编码示例
Unicode 字符版权标记字符 0xA9 = 1010 1001 用 UTF-8 编码如下所示：

11000010 10101001 = 0xC2 0xA9

“不等于”符号字符 0x2260 = 0010 0010 0110 0000 编码如下所示：

11100010 10001001 10100000 = 0xE2 0x89 0xA0

通过获取 continuation byte 的值可以看到原始数据：

[1110]0010 [10]001001 [10]100000 0010 001001 100000 0010 0010 0110 0000 = 0x2260

第一个字节定义后面紧跟的八位元数，如果是 7F 或更小，这就是等价的 ASCII 值。每个八位字节以 10 xxxxxx 开头，确保字节不与 ASCII 的值混淆。

UTF 支持
在 Linux 平台上使用 UTF-8 之前，请确信分发包里有 glibc 2.2 和 XFree86 4.0 或更新的版本。早先的版本缺少 UTF-8 语言环境支持和 ISO10646-1 X11 字体。

在 UTF-8 发布之前，Linux 用户使用各种不同特定语言的扩展 ASCII，像欧洲用户用 ISO 8859-1 或 ISO 8859-2，希腊用户使用 ISO 8859-7，俄罗斯用户使用 KOI-8 / ISO 8859-5/CP1251（西里尔字母）。这使得数据交换出现了很多问题，并且需要为这些编码之间的差异编写应用软件。这种语言支持是不完善的，而且数据交换没有经过测试。Linux 主要的发行商和应用程序开发者正致力于让主要以 UTF-8 格式表示的 Unicode 成为 Linux 中的标准。

为了识别 Unicode 文件，Microsoft 建议所有的 Unicode 文件应该以 ZERO WIDTH NOBREAK SPACE（U+FEFF）字符开头。这作为一个“特征符”或“字节顺序标记（byte-order mark，BOM）”来识别文件中使用的编码和字节顺序。但是，Linux/UNIX 并没有使用 BOM，因为它会破坏现有的 ASCII 文件的语法约定。在 POSIX 系统中，选中的语言环境识别了在一个过程中的所有输入输出文件期望的编码形式。

有两种方法可以将 UTF-8 支持添加到 Linux 应用程序中。第一种方法，数据都以 UTF-8 形式存放在各处，这样软件改动很少（被动的）。另一种方法，被读取的 UTF-8 数据用标准的 C 语言库函数转变成为宽字符数组（转换的）。在输出时，用函数 wcsrtombs() 使字符串被转变回 UTF-8：

清单 1. wcsrtombs()

#include  
size_t wcsrtombs (char *dest, const wchar_t **src, size_t len, mbstate_t *ps);

方法的选择取决于应用程序的性质。大多数应用程序可以使用被动的方法操作。这就是在 UNIX 平台上使用 UTF-8 会如此流行的原因。像 cat 和 echo 那样的程序就不需要修改。字节流仍只是字节流，并没有对它进行任何处理。ASCII 字符和控制代码在 UTF-8 语言环境中不改变。

通过字节计数对字符进行计数的程序需要一些小小的改动。在 UTF-8 中应用程序不对任何扩展的字节进行计数。如果选择了 UTF-8 语言环境，C 语言库的 strlen(s) 函数需要用 mbstowcs() 函数来代替：

清单 2. mbstowcs() 函数

#include 
size_t mbstowcs(wchar_t *pwcs, const char *s, size_t n);

strlen 的一种常见用法是估算显示宽度。中文和其它表意符号将占用两列位置。 wcwidth() 函数用来测试每个字符的显示宽度：

清单 3. wcwidth() 函数

#include <
        wchar.h> 
int wcwidth(wchar_t wc);

Unicode 的 C 语言支持
在正式情况下，从 GNU glibc 2.2 开始，wchar_t 类型只为 32 位的 ISO 10646 格式数值所特定使用，与当前使用的语言环境无关。通过 ISO C99 所要求的 __STDC_ISO_10646__ 宏的定义作为信号通知应用程序。 __STDC_ISO_10646__ 的定义用来指出 wchar_t 是 Unicode。精确的值是一个十进制的 yyyymmL 格式的常数。例如，使用：

清单 4. 指出 wchar_t 是 Unicode

#define __STDC_ISO_10646__ 200104L

是为指出 wchar_t 类型的值是由 ISO/IEC 10646 和到指定的年月为止的所有修正与技术勘误定义的字符编码表示。

对 wchar_t 的利用如这个示例所示，使用宏确定在 ISO C99 可移植代码中写双引号的方法。

清单 5. 确定写双引号的方法

#if __STDC_ISO_10646__  
   printf("%lc", 0x201c);  
#else  
   putchar('"');  
#fi

语言环境
激活 UTF-8 的恰当的办法是 POSIX 语言环境机制。语言环境是一种包含有关软件行为特定文化约定的配置设定。它包含了字符编码、日期／时间符号、分类规则以及度量系统。语言环境的名称通常由 ISO 639-1 语言、ISO 3166-1 国家或地区代码以及可选的编码名称和其它限定符组成。您可以用命令 locale -a 获取所有安装在系统上的语言环境列表（通常在 /usr/lib/locale/）。

如果没有预安装 UTF-8 语言环境，你可以用 localedef 命令生成它。若要为某个特定用户生成并激活一个德语的 UTF-8 语言环境，请使用如下语句：

清单 6. 为特定用户生成语言环境

localedef -v -c -i de_DE -f UTF-8 $HOME/local/locale/de_DE.UTF-8
export LOCPATH=$HOME/local/locale
export LANG=de_DE.UTF-8

有时候为所有用户添加 UTF-8 语言环境会很有用。root 用户使用如下指令就可以完成：

清单 7. 为每个用户生成语言环境

localedef -v -c -i de_DE -f UTF-8 /usr/share/locale/de_DE.UTF-8

若要为每个用户将这个语言环境设为缺省值，可以将以下行添加到 /etc/profile 文件中：

清单 8. 为所有用户设置缺省的语言环境

export LANG=de_DE.UTF-8

处理多字节字符代码序列的函数行为依赖于当前语言环境的 LC_CTYPE 类别；它确定了依赖语言环境的多字节编码。值 LANG=de_DE（德语）会导致输出按 ISO 8859-1 被格式化。值 LANG=de_DE.UTF-8 会把输出格式化成 UTF-8。语言环境设置会导致 printf 中的 %ls 格式说明符调用 wcsrtombs() 函数以便于将宽字符的参数字符串转换成依赖语言环境的多字节编码。语言环境中的国家或地区标识符如：LC_CTYPE= en_GB （英国英语）和 LC_CTYPE= en_AU（澳大利亚英语），它们之间的差异只在 LC_MONETARY 类别中，原因在于货币的名称和打印货币数量的规则不同。

请给您首选的语言环境设置环境变量 LANG。当一个 C 程序执行 setlocale() 函数时：

清单 9. setlocale() 函数

#include 
#include 
//char *setlocale(int category, const char *locale);
int main()
{
  if (!setlocale(LC_CTYPE, "")) 
  {
    fprintf(stderr, "Locale not specified. Check LANG, LC_CTYPE, LC_ALL.
");
    return 1;
  }

C 语言库将会依次测试环境变量 LC_ALL、LC_CTYPE 和 LANG。其中第一个含值的环境变量将决定为 LC_CTYPE 类别装入哪种语言环境数据。语言环境数据分裂成独立的类别。值 LC_CTYPE 定义了字符编码，而 LC_COLLATE 定义了排序顺序。我们用 LANG 环境变量为所有类别设置缺省语言环境，但 LC_* 变量可以用来覆盖单个类别。

您可以用命令 locale charmap 查询当前语言环境中字符编码的名称。如果您从 LC_CTYPE 类别中成功选取了 UTF-8 语言环境，会输出 UTF-8。命令 locale -m 提供一张已安装的所有字符编码名称的列表。

如果您使用专门的 C 语言库的多字节函数来完成所有外部字符编码和内部使用的 wchar_t 编码之间的转换，那么 C 语言库将承担责任，根据 LC_CTYPE 使用正确的编码方式。这甚至不需要程序被明确的编码成当前的多字节编码。

如果需要一个应用程序能明确的支持 UTF-8（或其它编码）转换方法而不用 libc 多字节函数，则应用程序必须确定是否需要激活 UTF-8 模式。带有库头文件的与 X/Open 兼容系统可以用如下代码：

清单 10. 检测当前的语言环境是否使用了 UTF-8 编码

BOOL utf8_mode = FALSE;

if( !  strcmp(nl_langinfo(CODESET), "UTF-8")
   utf8_mode = TRUE;

为检测当前语言环境是否使用了 UTF-8 编码。首先必须调用 setlocale(LC_CTYPE, "") 函数，依据环境变量设置语言环境。nl_langinfo(CODESET) 函数也是由 locale charmap 命令调用，从而查找当前语言环境指定的编码名称。

另一种可以使用的方法是查询语言环境变量：

清单 11. 查询语言环境变量

char *s;
BOOL utf8_mode = FALSE;

if ((s = getenv("LC_ALL")) || (s = getenv("LC_CTYPE")) || (s = getenv ("LANG"))) 

{
   if (strstr(s, "UTF-8"))
      utf8_mode = TRUE;
}

这项测试假设 UTF-8 语言环境名称中有值“UTF-8”，但实际情况并不总是如此，所以应该使用 nl_langinfo() 方法。

总结
为支持世界上的所有语言，需要一种具有八位字节字符编码策略的字符编码系统，它的字符应多于 ASCII（一种使用无符号字节的扩展版本）的 2^8 = 256 个字符。Unicode 就是这样一种字符编码系统，它具有由 128 个三维组（带有由大量字符编码方案的方法支持的 94,140 个定义好的字符值）组成的四维编码空间，在 Linux 中更流行的字符编码方案是 Unicode 转换格式 UTF-8。

参考资料

您可以参阅本文在 developerWorks 全球站点上的英文原文.
请访问 Unicode 联盟的 Unicode 主页，这里定义了 Unicode 字符之间的行为和关系，并为实现者提供了技术信息。
国际标准组织（International Organization for Standardization，ISO）是一个由 140 个国家组成的全球性的国家标准社团联盟。
ANSI 是个私有的、非营利组织，它管理并调整 U.S. 的志愿标准化以及一致性评价系统。
ISO C99 Draft（Acrobat PDF 格式，556 页），是新的 C 语言标准，来自 Calgary 大学 Ben 的 C 编程课程。
C 语言的新 ISO 标准讨论了 C9x 标准。
请阅读 Roman Czyborra 的 Unix 环境下的 Unicode。
请查阅由 David A. Wheeler 撰写的 Secure Programming for Linux and Unix HOWTO中的 Character Encoding章节。
请阅读 IANA（Internet Assigned Numbers Authority）中的 IANA Charset Registration Procedures。
请参阅 Virginia 大学图书馆 Robertson Media 中心的 Unicode Music Symbols。
请看看 graphic representation of the Roadmap to the BMP, Plane 0 of the UCS。这些表包含了由 0 号，也就是通用字符集（Universal Character Set，UCS）的基本多语言平面（Basic Multilingual Plane，BMP）实际大小的映射组成的。Everson Gunn Teoranta 是一个自 1990 年开办的支持少数民族语言团体的软件和出版公司，由 Michael Everson 和 Marion Gunn 共同建立。
请浏览 UTF-8 and Unicode FAQ for UNIX/Linux，Markus Kuhn 的综合性的 one-stop 信息资源，关于您如何在 POSIX 系统（Linux，UNIX）使用 Unicode/UTF-8。
请检查 Technology Appraisals Ltd 的 Solution Given by the Universal Character Set，其中提供了独立的、高质量的有关电子商务系统、电子信息传递、XML、网络和 IT 安全的信息、教育和培训。
请阅读 Mulberry Technologies, Inc 的 Unicode presentation titled“10646 and All That”，一个专攻基于 SGML 和 XML 系统的电子出版物的咨询公司。
UTF-8, a transformation format of ISO 10646 是由俄亥俄州立大学的计算机和信息科学系指定的因特网社区的因特网标准跟踪协议。
请咨询 Linux 程序员手册上的 UTF-8 ― an ASCII compatible multi-byte Unicode encoding。
请阅读 Unicode Standard Annex#15 Unicode Normalization Forms，一篇描写了四种 Unicode 文本标准化格式规范的文档。有了这些格式，等价的（规范或是兼容的）文本将会有同样的二进制表式。当实现工具在标准化的格式中保留了一个字符串，可以确保有一个以二进制形式表现的独一无二的等价字符串。
请阅读 man-pages.net 上的 mbstowcs ，它把多字节字符串转换成了宽字符的字符串，man-pages.net 为 Linux 手册页面提供了永久的基于 Web 的归档文件。
请阅读 Menlo 学校的主页上的 wcwidth ，它能决定一个宽字符代码值的所占列位置的列数。
请阅读 Hewlett Packard 的开发者资源站点的 Linux 程序员手册上的 wcsrtombs ，它能将宽字符的字符串转化为多字节字符串。
请阅读 MKS 工具箱文档中的 setlocale() ，它能改变或查询语言环境。MKS 软件公司是在 Windows 环境或混合 UNIX/Linux 和 Windows 环境中用于系统管理和开发的 Windows 自动化工具的领先供应商。
请学习 IBM Classes for Unicode (ICU)，一个 C 语言和 C++ 语言库，它在许多平台上提供了健壮的和功能完善的 Unicode 支持。
请参阅 IBM 的 “Introduction to Unicode”站点，这里深入涵盖了 Unicode 基础知识。
在 IBM 的关于新兴技术的 alphaWorks站点。请参阅：
- UnicodeCompressor，这里提供了使用标准 Unicode 压缩方案的压缩和解压缩 Unicode 文本的工具
- Unicode Normalizer，为实现快速排序和搜索将 Java 字符串对象转换为标准 Unicode 格式。
请阅读 TW Burger 撰写的 “Cyrillic in Unicode”和 Jim Melnick 撰写的 “Multilingual forms in Unicode”，也在 developerWorks上。
请在 developerWorks上浏览更多 Linux 参考资料。

关于作者
TW Burger 从 1979 年起曾经做过编程、讲授中等计算机课程以及撰写有关计算机技术方面的书。他正在经营一个信息技术咨询公司。您可以通过 twburger@bigfoot.com 与他联系。

幽灵狼 2005-11-15 11:39 发表评论

NetBSD Code Style

幽灵狼 — Tue, 15 Nov 2005 03:38:00 GMT

/* $NetBSD: style,v 1.36 2005/08/25 17:51:58 briggs Exp $ */

/*
* The revision control tag appears first, with a blank line after it.
* Copyright text appears after the revision control tag.
*/

/*
* The NetBSD source code style guide.
* (Previously known as KNF - Kernel Normal Form).
*
* from: @(#)style 1.12 (Berkeley) 3/18/94
*/
/*
* An indent(1) profile approximating the style outlined in
* this document lives in /usr/share/misc/indent.pro. It is a
* useful tool to assist in converting code to KNF, but indent(1)
* output generated using this profile must not be considered to
* be an authoritative reference.
*/

/*
* Source code revision control identifiers appear after any copyright
* text. Use the appropriate macros from . Usually only one
* source file per program contains a __COPYRIGHT() section.
* Historic Berkeley code may also have an __SCCSID() section.
* Only one instance of each of these macros can occur in each file.
*/
#include
__COPYRIGHT("@(#) Copyright (c) 2000\n\
The NetBSD Foundation, inc. All rights reserved.\n");
__RCSID("$NetBSD: style,v 1.36 2005/08/25 17:51:58 briggs Exp $");

/*
* VERY important single-line comments look like this.
*/

/* Most single-line comments look like this. */

/*
* Multi-line comments look like this. Make them real sentences. Fill
* them so they look like real paragraphs.
*/

/*
* Attempt to wrap lines longer than 80 characters appropriately.
* Refer to the examples below for more information.
*/

/*
* EXAMPLE HEADER FILE:
*
* A header file should protect itself against multiple inclusion.
* E.g, would contain something like:
*/
#ifndef _SYS_SOCKET_H_
#define _SYS_SOCKET_H_
/*
* Contents of #include file go between the #ifndef and the #endif at the end.
*/
#endif /* !_SYS_SOCKET_H_ */
/*
* END OF EXAMPLE HEADER FILE.
*/

/*
* Kernel include files come first.
*/
#include /* Non-local includes in brackets. */

/*
* If it's a network program, put the network include files next.
* Group the includes files by subdirectory.
*/
#include
#include
#include
#include
#include

/*
* Then there's a blank line, followed by the /usr include files.
* The /usr include files should be sorted!
*/
#include
#include
#include
#include
#include

/*
* Global pathnames are defined in /usr/include/paths.h. Pathnames local
* to the program go in pathnames.h in the local directory.
*/
#include

/* Then, there's a blank line, and the user include files. */
#include "pathnames.h" /* Local includes in double quotes. */

/*
* ANSI function declarations for private functions (i.e. functions not used
* elsewhere) and the main() function go at the top of the source module.
* Don't associate a name with the types. I.e. use:
* void function(int);
* Use your discretion on indenting between the return type and the name, and
* how to wrap a prototype too long for a single line. In the latter case,
* lining up under the initial left parenthesis may be more readable.
* In any case, consistency is important!
*/
static char *function(int, int, float, int);
static int dirinfo(const char *, struct stat *, struct dirent *,
struct statfs *, int *, char **[]);
static void usage(void);
int main(int, char *[]);

/*
* Macros are capitalized, parenthesized, and should avoid side-effects.
* Spacing before and after the macro name may be any whitespace, though
* use of TABs should be consistent through a file.
* If they are an inline expansion of a function, the function is defined
* all in lowercase, the macro has the same name all in uppercase.
* If the macro is an expression, wrap the expression in parenthesis.
* If the macro is more than a single statement, use ``do { ... } while (0)'',
* so that a trailing semicolon works. Right-justify the backslashes; it
* makes it easier to read. The CONSTCOND comment is to satisfy lint(1).
*/
#define MACRO(v, w, x, y)      \
do {         \
v = (x) + (y);       \
w = (y) + 2;       \
} while (/* CONSTCOND */ 0)

#define DOUBLE(x) ((x) * 2)

/* Enum types are capitalized. No comma on the last element. */
enum enumtype {
ONE,
TWO
} et;

/*
* When declaring variables in structures, declare them organized by use in
* a manner to attempt to minimize memory wastage because of compiler alignment
* issues, then by size, and then by alphabetical order. E.g, don't use
* ``int a; char *b; int c; char *d''; use ``int a; int b; char *c; char *d''.
* Each variable gets its own type and line, although an exception can be made
* when declaring bitfields (to clarify that it's part of the one bitfield).
* Note that the use of bitfields in general is discouraged.
*
* Major structures should be declared at the top of the file in which they
* are used, or in separate header files, if they are used in multiple
* source files. Use of the structures should be by separate declarations
* and should be "extern" if they are declared in a header file.
*
* It may be useful to use a meaningful prefix for each member name.
* E.g, for ``struct softc'' the prefix could be ``sc_''.
*/
struct foo {
struct foo *next; /* List of active foo */
struct mumble amumble; /* Comment for mumble */
int bar;
unsigned int baz:1, /* Bitfield; line up entries if desired */
       fuz:5,
       zap:2;
uint8_t flag;
};
struct foo *foohead;  /* Head of global foo list */

/* Make the structure name match the typedef. */
typedef struct BAR {
int level;
} BAR;

/* C99 uintN_t is preferred over u_intN_t. */
uint32_t zero;

/*
* All major routines should have a comment briefly describing what
* they do. The comment before the "main" routine should describe
* what the program does.
*/
int
main(int argc, char *argv[])
{
long num;
int ch;
char *ep;

/*
* At the start of main(), call setprogname() to set the program
* name. This does nothing on NetBSD, but increases portability
* to other systems.
*/
setprogname(argv[0]);

/*
* For consistency, getopt should be used to parse options. Options
* should be sorted in the getopt call and the switch statement, unless
* parts of the switch cascade. Elements in a switch statement that
* cascade should have a FALLTHROUGH comment. Numerical arguments
* should be checked for accuracy. Code that cannot be reached should
* have a NOTREACHED comment.
*/
while ((ch = getopt(argc, argv, "abn")) != -1) {
  switch (ch) {  /* Indent the switch. */
  case 'a':  /* Don't indent the case. */
   aflag = 1;
   /* FALLTHROUGH */
  case 'b':
   bflag = 1;
   break;
  case 'n':
   errno = 0;
   num = strtol(optarg, &ep, 10);
   if (num <= 0 || *ep != '\0' || (errno == ERANGE &&
       (num == LONG_MAX || num == LONG_MIN)) )
    errx(1, "illegal number -- %s", optarg);
   break;
  case '?':
  default:
   usage();
   /* NOTREACHED */
  }
}
argc -= optind;
argv += optind;

/*
* Space after keywords (while, for, return, switch). No braces are
* used for control statements with zero or only a single statement,
* unless it's a long statement.
*
* Forever loops are done with for's, not while's.
*/
for (p = buf; *p != '\0'; ++p)
continue; /* Explicit no-op */
for (;;)
stmt;

/*
* Parts of a for loop may be left empty. Don't put declarations
* inside blocks unless the routine is unusually complicated.
*/
for (; cnt < 15; cnt++) {
stmt1;
stmt2;
}

/* Second level indents are four spaces. */
while (cnt < 20)
  z = a + really + long + statement + that + needs + two lines +
      gets + indented + four + spaces + on + the + second +
      and + subsequent + lines;

/*
* Closing and opening braces go on the same line as the else.
* Don't add braces that aren't necessary except in cases where
* there are ambiguity or readability issues.
*/
if (test) {
  /*
   * I have a long comment here.
   */
#ifdef zorro
  z = 1;
#else
  b = 3;
#endif
} else if (bar) {
  stmt;
  stmt;
} else
  stmt;

/* No spaces after function names. */
if ((result = function(a1, a2, a3, a4)) == NULL)
exit(1);

/*
* Unary operators don't require spaces, binary operators do.
* Don't excessively use parenthesis, but they should be used if
* statement is really confusing without them, such as:
* a = b->c[0] + ~d == (e || f) || g && h ? i : j >> 1;
*/
a = ((b->c[0] + ~d == (e || f)) || (g && h)) ? i : (j >> 1);
k = !(l & FLAGS);

/*
* Exits should be EXIT_SUCCESS on success, and EXIT_FAILURE on
* failure. Don't denote all the possible exit points, using the
* integers 1 through 127. Avoid obvious comments such as "Exit
* 0 on success.". Since main is a function that returns an int,
* prefer returning from it, than calling exit.
*/
return EXIT_SUCCESS;
}

/*
* The function type must be declared on a line by itself
* preceding the function.
*/
static char *
function(int a1, int a2, float fl, int a4)
{
/*
* When declaring variables in functions declare them sorted by size,
* then in alphabetical order; multiple ones per line are okay.
* Function prototypes should go in the include file "extern.h".
* If a line overflows reuse the type keyword.
*
* DO NOT initialize variables in the declarations.
*/
extern u_char one;
extern char two;
struct foo three, *four;
double five;
int *six, seven;
char *eight, *nine, ten, eleven, twelve, thirteen;
char fourteen, fifteen, sixteen;

/*
* Casts and sizeof's are not followed by a space. NULL is any
* pointer type, and doesn't need to be cast, so use NULL instead
* of (struct foo *)0 or (struct foo *)NULL. Also, test pointers
* against NULL. I.e. use:
*
* (p = f()) == NULL
* not:
* !(p = f())
*
* Don't use `!' for tests unless it's a boolean.
* E.g. use "if (*p == '\0')", not "if (!*p)".
*
* Routines returning ``void *'' should not have their return
* values cast to more specific pointer types.
*
* Use err/warn(3), don't roll your own!
*/
if ((four = malloc(sizeof(struct foo))) == NULL)
err(1, NULL);
if ((six = (int *)overflow()) == NULL)
errx(1, "Number overflowed.");

/* No parentheses are needed around the return value. */
return eight;
}

/*
* Use ANSI function declarations. ANSI function braces look like
* old-style (K&R) function braces.
* As per the wrapped prototypes, use your discretion on how to format
* the subsequent lines.
*/
static int
dirinfo(const char *p, struct stat *sb, struct dirent *de, struct statfs *sf,
int *rargc, char **rargv[])
{ /* Insert an empty line if the function has no local variables. */

/*
* In system libraries, catch obviously invalid function arguments
* using _DIAGASSERT(3).
*/
_DIAGASSERT(p != NULL);
_DIAGASSERT(filedesc != -1);

if (stat(p, sb) < 0)
err(1, "Unable to stat %s", p);

/*
* To printf quantities that might be larger that "long", include
* , cast quantities to intmax_t or uintmax_t and use
* PRI?MAX constants, which may be found in .
*/
(void)printf("The size of %s is %" PRIdMAX " (%#" PRIxMAX ")\n", p,
(intmax_t)sb->st_size, (uintmax_t)sb->st_size);

/*
* To printf quantities of known bit-width, use the corresponding
* defines (generally only done within NetBSD for quantities that
* exceed 32-bits).
*/
(void)printf("%s uses %" PRId64 " blocks and has flags %#" PRIx32 "\n",
p, sb->st_blocks, sb->st_flags);

/*
* There are similar constants that should be used with the *scanf(3)
* family of functions: SCN?MAX, SCN?64, etc.
*/
}

/*
* Functions that support variable numbers of arguments should look like this.
* (With the #include appearing at the top of the file with the
* other include files).
*/
#include

void
vaf(const char *fmt, ...)
{
va_list ap;

va_start(ap, fmt);
STUFF;
va_end(ap);
/* No return needed for void functions. */
}

static void
usage(void)
{

/*
* Use printf(3), not fputs/puts/putchar/whatever, it's faster and
* usually cleaner, not to mention avoiding stupid bugs.
* Use snprintf(3) or strlcpy(3)/strlcat(3) instead of sprintf(3);
* again to avoid stupid bugs.
*
* Usage statements should look like the manual pages. Options w/o
* operands come first, in alphabetical order inside a single set of
* braces. Followed by options with operands, in alphabetical order,
* each in braces. Followed by required arguments in the order they
* are specified, followed by optional arguments in the order they
* are specified. A bar (`|') separates either/or options/arguments,
* and multiple options/arguments which are specified together are
* placed in a single set of braces.
*
* Use getprogname() instead of hardcoding the program name.
*
* "usage: f [-ade] [-b b_arg] [-m m_arg] req1 req2 [opt1 [opt2]]\n"
* "usage: f [-a | -b] [-c [-de] [-n number]]\n"
*/
(void)fprintf(stderr, "usage: %s [-ab]\n", getprogname());
exit(EXIT_FAILURE);
}

幽灵狼 2005-11-15 11:38 发表评论

ANSI Escape Sequence

幽灵狼 — Tue, 15 Nov 2005 03:37:00 GMT

ANSI Escape Sequence

Clear Display

Function	ESC Sequence	Description
Clear Screen	`ESC[2J`	Clear the whole screen and position the cursor to the top left corner.
Clear Line	`ESC[K`	Clear line, from cursor position to the right most position of line.

Cursor Movement

Function	ESC Sequence	Description
Move Up	`ESC[numA`	Move the cursor up `num` positions
Move Down	`ESC[numB`	Move the cursor down `num` positions
Move Right	`ESC[numC`	Move the cursor right `num` positions
Move Left	`ESC[numD`	Move the cursor left `num` positions
Move to Position	`ESC[row;colH`	Move the cursor to the (`col`, `row`) position. Note that the row comes before column; that is, y comes before x. Either `col` or `row` can be omitted. Row and column both start with "1," not zero. (1, 1) corresponds to the top-left corner of the screen.
Move to Position	`ESC[row;colf`	Same as above.

Save and Restore Cursor Position

Function	ESC Sequence	Description
Save Cursor Positon	`ESC[s`	Save the cursor position for later restoration.
Restore Cursor Positon	`ESC[u`	Restore the cursor position previously saved.

Character Mode

Function	ESC Sequence	Description
Change Character Mode	`ESC[attrm`	Change the character mode with attribute `attr`. The attributes are numbers listed below.
Change Character Mode	`ESC[attr;...;attrm`	Change the character mode with attributes `attr;...;attr`. The attributes are numbers listed below.
All Off	`0`	All attributes turned off. (Except for foreground and background color).
High Intensity	`1`	Bold.
Low Intensity	`2`	Normal.
Italic	`3`	Work only on some systems.
Underline	`4`	Underline font.
Blink	`5`	Blinking font.
Rapid Blink	`6`	Works only on some systems.
Reverse Video	`7`	Swapping the foreground color and the background color.
Invisible	`8`	Do not display characters.
Foreground Color	`30`	Black.
Foreground Color	`31`	Red.
Foreground Color	`32`	Green.
Foreground Color	`33`	Yellow.
Foreground Color	`34`	Blue.
Foreground Color	`35`	Magenta.
Foreground Color	`36`	Cyan.
Foreground Color	`37`	White.
Background Color	`40`	Black.
Background Color	`41`	Red.
Background Color	`42`	Green.
Background Color	`43`	Yellow.
Background Color	`44`	Blue.
Background Color	`45`	Magenta.
Background Color	`46`	Cyan.
Background Color	`47`	White.

thanks to Kenneth Kin Lum.

幽灵狼 2005-11-15 11:37 发表评论

ANSI.SYS

幽灵狼 — Tue, 15 Nov 2005 03:37:00 GMT

 
                                  ANSI.SYS
 
Defines functions that change display graphics, control cursor movement, and
reassign keys. The ANSI.SYS device driver supports ANSI terminal emulation
of escape sequences to control your system's screen and keyboard. An ANSI
escape sequence is a sequence of ASCII characters, the first two of which
are the escape character (1Bh) and the left-bracket character (5Bh). The
character or characters following the escape and left-bracket characters
specify an alphanumeric code that controls a keyboard or display function.
ANSI escape sequences distinguish between uppercase and lowercase letters;
for example,"A" and "a" have completely different meanings.
 
This device driver must be loaded by a < DEVICE > or < DEVICEHIGH > command in
your CONFIG.SYS file.
 
Note:  In this topic bold letters in syntax and ANSI escape sequences
       indicate text you must type exactly as it appears.
 
Syntax
 
    DEVICE=[drive:][path]ANSI.SYS [/X] [/K] [/R]
 
Parameter
 
[drive:][path]
   Specifies the location of the ANSI.SYS file.
 
Switches
 
/X
    Remaps extended keys independently on 101-key keyboards.
 
/K
    Causes ANSI.SYS to treat a 101-key keyboard like an 84-key
    keyboard. This is equivalent to the command SWITCHES=/K.
    If you usually use the SWITCHES=/K command, you will need
    to use the /K switch with ANSI.SYS.
 
/R
     Adjusts line scrolling to improve readability when ANSI.SYS
     is used with screen-reading programs (which make computers
     more accessible to people with disabilities).
 
Parameters used in ANSI escape sequences
 
Pn
    Numeric parameter. Specifies a decimal number.
 
Ps
    Selective parameter. Specifies a decimal number that you use to select
    a function. You can specify more than one function by separating the
    parameters with semicolons.
 
PL
    Line parameter. Specifies a decimal number that represents one of the
    lines on your display or on another device.
 
Pc
    Column parameter. Specifies a decimal number that represents one of the
    columns on your screen or on another device.
 
ANSI escape sequences for cursor movement, graphics, and keyboard settings
 
In the following list of ANSI escape sequences, the abbreviation ESC
represents the ASCII escape character 27 (1Bh), which appears at the
beginning of each escape sequence.
 
ESC[PL;PcH
    Cursor Position: Moves the cursor to the specified position
    (coordinates). If you do not specify a position, the cursor moves to the
    home position��the upper-left corner of the screen (line 0, column
    0). This escape sequence works the same way as the following Cursor
    Position escape sequence.
 
ESC[PL;Pcf
    Cursor Position: Works the same way as the preceding Cursor Position
    escape sequence.
 
ESC[PnA
    Cursor Up: Moves the cursor up by the specified number of lines without
    changing columns. If the cursor is already on the top line, ANSI.SYS
    ignores this sequence.
 
ESC[PnB
    Cursor Down: Moves the cursor down by the specified number of lines
    without changing columns. If the cursor is already on the bottom line,
    ANSI.SYS ignores this sequence.
 
ESC[PnC
    Cursor Forward: Moves the cursor forward by the specified number of
    columns without changing lines. If the cursor is already in the
    rightmost column, ANSI.SYS ignores this sequence.
 
ESC[PnD
    Cursor Backward: Moves the cursor back by the specified number of
    columns without changing lines. If the cursor is already in the leftmost
    column, ANSI.SYS ignores this sequence.
 
ESC[s
    Save Cursor Position: Saves the current cursor position. You can move
    the cursor to the saved cursor position by using the Restore Cursor
    Position sequence.
 
ESC[u
    Restore Cursor Position: Returns the cursor to the position stored
    by the Save Cursor Position sequence.
 
ESC[2J
    Erase Display: Clears the screen and moves the cursor to the home
    position (line 0, column 0).
 
ESC[K
    Erase Line: Clears all characters from the cursor position to the
    end of the line (including the character at the cursor position).
 
ESC[Ps;...;Psm
    Set Graphics Mode: Calls the graphics functions specified by the
    following values. These specified functions remain active until the next
    occurrence of this escape sequence. Graphics mode changes the colors and
    attributes of text (such as bold and underline) displayed on the
    screen.
 
    Text attributes
       0    All attributes off
       1    Bold on
       4    Underscore (on monochrome display adapter only)
       5    Blink on
       7    Reverse video on
       8    Concealed on
 
    Foreground colors
       30    Black
       31    Red
       32    Green
       33    Yellow
       34    Blue
       35    Magenta
       36    Cyan
       37    White
 
    Background colors
       40    Black
       41    Red
       42    Green
       43    Yellow
       44    Blue
       45    Magenta
       46    Cyan
       47    White
 
    Parameters 30 through 47 meet the ISO 6429 standard.
 
ESC[=psh
    Set Mode: Changes the screen width or type to the mode specified
    by one of the following values:
 
       0      40 x 148 x 25 monochrome (text)
       1      40 x 148 x 25 color (text)
       2      80 x 148 x 25 monochrome (text)
       3      80 x 148 x 25 color (text)
       4      320 x 148 x 200 4-color (graphics)
       5      320 x 148 x 200 monochrome (graphics)
       6      640 x 148 x 200 monochrome (graphics)
       7      Enables line wrapping
      13      320 x 148 x 200 color (graphics)
      14      640 x 148 x 200 color (16-color graphics)
      15      640 x 148 x 350 monochrome (2-color graphics)
      16      640 x 148 x 350 color (16-color graphics)
      17      640 x 148 x 480 monochrome (2-color graphics)
      18      640 x 148 x 480 color (16-color graphics)
      19      320 x 148 x 200 color (256-color graphics)
 
ESC[=Psl
    Reset Mode: Resets the mode by using the same values that Set Mode
    uses, except for 7, which disables line wrapping. The last character
    in this escape sequence is a lowercase L.
 
ESC[code;string;...p
    Set Keyboard Strings: Redefines a keyboard key to a specified string.
    The parameters for this escape sequence are defined as follows:
 
      Code is one or more of the values listed in the following table.
       These values represent keyboard keys and key combinations. When using
       these values in a command, you must type the semicolons shown in this
       table in addition to the semicolons required by the escape sequence.
       The codes in parentheses are not available on some keyboards.
       ANSI.SYS will not interpret the codes in parentheses for those
       keyboards unless you specify the /X switch in the DEVICE command for
       ANSI.SYS.
 
      String is either the ASCII code for a single character or a string
       contained in quotation marks. For example, both 65 and "A" can be
       used to represent an uppercase A.
 
IMPORTANT:  Some of the values in the following table are not valid for all
            computers. Check your computer's documentation for values that
            are different.
 
Key                       Code      SHIFT+code  CTRL+code  ALT+code
���������������������������������������������������������������������������
 
F1                        0;59      0;84        0;94       0;104
 
F2                        0;60      0;85        0;95       0;105
 
F3                        0;61      0;86        0;96       0;106
 
F4                        0;62      0;87        0;97       0;107
 
F5                        0;63      0;88        0;98       0;108
 
F6                        0;64      0;89        0;99       0;109
 
F7                        0;65      0;90        0;100      0;110
 
F8                        0;66      0;91        0;101      0;111
 
F9                        0;67      0;92        0;102      0;112
 
F10                       0;68      0;93        0;103      0;113
 
F11                       0;133     0;135       0;137      0;139
 
F12                       0;134     0;136       0;138      0;140
 
HOME (num keypad)         0;71      55          0;119      ��
 
UP ARROW (num keypad)     0;72      56          (0;141)    ��
 
PAGE UP (num keypad)      0;73      57          0;132      ��
 
LEFT ARROW (num keypad)   0;75      52          0;115      ��
 
RIGHT ARROW (num          0;77      54          0;116      ��
keypad)
 
END (num keypad)          0;79      49          0;117      ��
 
DOWN ARROW (num keypad)   0;80      50          (0;145)    ��
 
PAGE DOWN (num keypad)    0;81      51          0;118      ��
 
INSERT (num keypad)       0;82      48          (0;146)    ��
 
DELETE  (num keypad)      0;83      46          (0;147)    ��
 
HOME                      (224;71)  (224;71)    (224;119)  (224;151)
 
UP ARROW                  (224;72)  (224;72)    (224;141)  (224;152)
 
PAGE UP                   (224;73)  (224;73)    (224;132)  (224;153)
 
LEFT ARROW                (224;75)  (224;75)    (224;115)  (224;155)
 
RIGHT ARROW               (224;77)  (224;77)    (224;116)  (224;157)
 
END                       (224;79)  (224;79)    (224;117)  (224;159)
 
DOWN ARROW                (224;80)  (224;80)    (224;145)  (224;154)
 
PAGE DOWN                 (224;81)  (224;81)    (224;118)  (224;161)
 
INSERT                    (224;82)  (224;82)    (224;146)  (224;162)
 
DELETE                    (224;83)  (224;83)    (224;147)  (224;163)
 
PRINT SCREEN              ��        ��          0;114      ��
 
PAUSE/BREAK               ��        ��          0;0        ��
 
BACKSPACE                 8         8           127        (0)
 
ENTER                     13        ��          10         (0
 
TAB                       9         0;15        (0;148)    (0;165)
 
NULL                      0;3       ��          ��         ��
 
A                         97        65          1          0;30
 
B                         98        66          2          0;48
 
C                         99        66          3          0;46
 
D                         100       68          4          0;32
 
E                         101       69          5          0;18
 
F                         102       70          6          0;33
 
G                         103       71          7          0;34
 
H                         104       72          8          0;35
 
I                         105       73          9          0;23
 
J                         106       74          10         0;36
 
K                         107       75          11         0;37
 
L                         108       76          12         0;38
 
M                         109       77          13         0;50
 
N                         110       78          14         0;49
 
O                         111       79          15         0;24
 
P                         112       80          16         0;25
 
Q                         113       81          17         0;16
 
R                         114       82          18         0;19
 
S                         115       83          19         0;31
 
T                         116       84          20         0;20
 
U                         117       85          21         0;22
 
V                         118       86          22         0;47
 
W                         119       87          23         0;17
 
X                         120       88          24         0;45
 
Y                         121       89          25         0;21
 
Z                         122       90          26         0;44
 
1                         49        33          ��         0;120
 
2                         50        64          0          0;121
 
3                         51        35          ��         0;122
 
4                         52        36          ��         0;123
 
5                         53        37          ��         0;124
 
6                         54        94          30         0;125
 
7                         55        38          ��         0;126
 
8                         56        42          ��         0;126
 
9                         57        40          ��         0;127
 
0                         48        41          ��         0;129
 
-                         45        95          31         0;130
 
=                         61        43          ��-        0;131
 
[                         91        123         27         0;26
 
]                         93        125         29         0;27
 
                          92        124         28         0;43
 
;                         59        58          ��         0;39
 
'                         39        34          ��         0;40
 
,                         44        60          ��         0;51
 
.                         46        62          ��         0;52
 
/                         47        63          ��         0;53
 
`                         96        126         ��         (0;41)
 
ENTER (keypad)            13        ��          10         (0;166)
 
/ (keypad)                47        47          (0;142)    (0;74)
 
* (keypad)                42        (0;144)     (0;78)     ��
 
- (keypad)                45        45          (0;149)    (0;164)
 
+ (keypad)                43        43          (0;150)    (0;55)
 
5 (keypad)                (0;76)    53          (0;143)    ��

幽灵狼 2005-11-15 11:37 发表评论

Problem when compiling unpv12e in FreeBSD(no problems with unpv13e)

幽灵狼 — Fri, 11 Nov 2005 07:25:00 GMT

Problem when build lib:

spiwolf@fb$ cd lib
spiwolf@fb$ gmake
gcc -g -O2 -Wall -c mcast_leave.c
mcast_leave.c: In function `mcast_leave':
mcast_leave.c:26: `IPV6_DROP_MEMBERSHIP' undeclared (first use in this function)
mcast_leave.c:26: (Each undeclared identifier is reported only once
mcast_leave.c:26: for each function it appears in.)
*** Error code 1

Stop in /home/gabriel/unpv12e/lib.

The resolution for it is:

change the following:
mcast_leave.c: Change IPV6_DROP_MEMBERSHIP to IPV6_LEAVE_GROUP
mcast_join.c: Change IPV6_ADD_MEMBERSHIP to IPV6_JOIN_GROUP

IIRC these names were changed by a later RFC (2553, which obsoletes 2133 and is obsoleted by 3493). Matter of fact, I was grousing about this, a while ago, about this very issue: http://forums.devshed.com/t53905/s.html?perpage=15&pagenumber=2

幽灵狼 2005-11-11 15:25 发表评论

Debugging

幽灵狼 — Fri, 11 Nov 2005 07:24:00 GMT

2.6 Debugging

2.6.1 The Debugger

The debugger that comes with FreeBSD is called gdb (GNU debugger). You start it up by typing

% gdb progname

although most people prefer to run it inside Emacs. You can do this by:

M-x gdb RET progname RET

Using a debugger allows you to run the program under more controlled circumstances. Typically, you can step through the program a line at a time, inspect the value of variables, change them, tell the debugger to run up to a certain point and then stop, and so on. You can even attach to a program that is already running, or load a core file to investigate why the program crashed. It is even possible to debug the kernel, though that is a little trickier than the user applications we will be discussing in this section.

gdb has quite good on-line help, as well as a set of info pages, so this section will concentrate on a few of the basic commands.

Finally, if you find its text-based command-prompt style off-putting, there is a graphical front-end for it (xxgdb) in the ports collection.

This section is intended to be an introduction to using gdb and does not cover specialized topics such as debugging the kernel.

2.6.2 Running a program in the debugger

You will need to have compiled the program with the -g option to get the most out of using gdb. It will work without, but you will only see the name of the function you are in, instead of the source code. If you see a line like:

... (no debugging symbols found) ...

when gdb starts up, you will know that the program was not compiled with the -g option.

At the gdb prompt, type break main. This will tell the debugger to skip over the preliminary set-up code in the program and start at the beginning of your code. Now type run to start the program--it will start at the beginning of the set-up code and then get stopped by the debugger when it calls main(). (If you have ever wondered where main() gets called from, now you know!).

You can now step through the program, a line at a time, by pressing n. If you get to a function call, you can step into it by pressing s. Once you are in a function call, you can return from stepping into a function call by pressing f. You can also use up and down to take a quick look at the caller.

Here is a simple example of how to spot a mistake in a program with gdb. This is our program (with a deliberate mistake):

#include 

int bazz(int anint);

main() {
    int i;

    printf("This is my program\n");
    bazz(i);
    return 0;
}

int bazz(int anint) {
    printf("You gave me %d\n", anint);
    return anint;
}

This program sets i to be 5 and passes it to a function bazz() which prints out the number we gave it.

When we compile and run the program we get

% cc -g -o temp temp.c
% ./temp
This is my program
anint = 4231

That was not what we expected! Time to see what is going on!

% gdb temp
GDB is free software and you are welcome to distribute copies of it
 under certain conditions; type "show copying" to see the conditions.
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.13 (i386-unknown-freebsd), Copyright 1994 Free Software Foundation, Inc.
(gdb) break main               Skip the set-up code
Breakpoint 1 at 0x160f: file temp.c, line 9.    gdb puts breakpoint at main()
(gdb) run                   Run as far as main()
Starting program: /home/james/tmp/temp      Program starts running

Breakpoint 1, main () at temp.c:9       gdb stops at main()
(gdb) n                       Go to next line
This is my program              Program prints out
(gdb) s                       step into bazz()
bazz (anint=4231) at temp.c:17          gdb displays stack frame
(gdb)

Hang on a minute! How did anint get to be 4231? Did we not we set it to be 5 in main()? Let's move up to main() and have a look.

(gdb) up                   Move up call stack
#1  0x1625 in main () at temp.c:11      gdb displays stack frame
(gdb) p i                   Show us the value of i
$1 = 4231                   gdb displays 4231

Oh dear! Looking at the code, we forgot to initialize i. We meant to put

...
main() {
    int i;

    i = 5;
    printf("This is my program\n");
...

but we left the i=5; line out. As we did not initialize i, it had whatever number happened to be in that area of memory when the program ran, which in this case happened to be 4231.

Note: gdb displays the stack frame every time we go into or out of a function, even if we are using up and down to move around the call stack. This shows the name of the function and the values of its arguments, which helps us keep track of where we are and what is going on. (The stack is a storage area where the program stores information about the arguments passed to functions and where to go when it returns from a function call).

2.6.3 Examining a core file

A core file is basically a file which contains the complete state of the process when it crashed. In “the good old days”, programmers had to print out hex listings of core files and sweat over machine code manuals, but now life is a bit easier. Incidentally, under FreeBSD and other 4.4BSD systems, a core file is called progname.core instead of just core, to make it clearer which program a core file belongs to.

To examine a core file, start up gdb in the usual way. Instead of typing break or run, type

(gdb) core progname.core

If you are not in the same directory as the core file, you will have to do dir /path/to/core/file first.

You should see something like this:

% gdb a.out
GDB is free software and you are welcome to distribute copies of it
 under certain conditions; type "show copying" to see the conditions.
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.13 (i386-unknown-freebsd), Copyright 1994 Free Software Foundation, Inc.
(gdb) core a.out.core
Core was generated by `a.out'.
Program terminated with signal 11, Segmentation fault.
Cannot access memory at address 0x7020796d.
#0  0x164a in bazz (anint=0x5) at temp.c:17
(gdb)

In this case, the program was called a.out, so the core file is called a.out.core. We can see that the program crashed due to trying to access an area in memory that was not available to it in a function called bazz.

Sometimes it is useful to be able to see how a function was called, as the problem could have occurred a long way up the call stack in a complex program. The bt command causes gdb to print out a back-trace of the call stack:

(gdb) bt
#0  0x164a in bazz (anint=0x5) at temp.c:17
#1  0xefbfd888 in end ()
#2  0x162c in main () at temp.c:11
(gdb)

The end() function is called when a program crashes; in this case, the bazz() function was called from main().

2.6.4 Attaching to a running program

One of the neatest features about gdb is that it can attach to a program that is already running. Of course, that assumes you have sufficient permissions to do so. A common problem is when you are stepping through a program that forks, and you want to trace the child, but the debugger will only let you trace the parent.

What you do is start up another gdb, use ps to find the process ID for the child, and do

(gdb) attach pid

in gdb, and then debug as usual.

“That is all very well,” you are probably thinking, “but by the time I have done that, the child process will be over the hill and far away”. Fear not, gentle reader, here is how to do it (courtesy of the gdb info pages):

...
if ((pid = fork()) < 0)     /* _Always_ check this */
    error();
else if (pid == 0) {        /* child */
    int PauseMode = 1;

    while (PauseMode)
        sleep(10);  /* Wait until someone attaches to us */
    ...
} else {            /* parent */
    ...

Now all you have to do is attach to the child, set PauseMode to 0, and wait for the sleep() call to return!

Prev	Home	Next
Make	Up	Using Emacs as a Development Environment

This, and other documents, can be downloaded from ftp://ftp.FreeBSD.org/pub/FreeBSD/doc/.

For questions about FreeBSD, read the documentation before contacting <questions@FreeBSD.org>.
For questions about this documentation, e-mail <doc@FreeBSD.org>.

幽灵狼 2005-11-11 15:24 发表评论

Compiling

幽灵狼 — Fri, 11 Nov 2005 07:23:00 GMT

2.4 Compiling with `cc`

This section deals only with the GNU compiler for C and C++, since that comes with the base FreeBSD system. It can be invoked by either cc or gcc. The details of producing a program with an interpreter vary considerably between interpreters, and are usually well covered in the documentation and on-line help for the interpreter.

Once you have written your masterpiece, the next step is to convert it into something that will (hopefully!) run on FreeBSD. This usually involves several steps, each of which is done by a separate program.

Pre-process your source code to remove comments and do other tricks like expanding macros in C.
Check the syntax of your code to see if you have obeyed the rules of the language. If you have not, it will complain!
Convert the source code into assembly language--this is very close to machine code, but still understandable by humans. Allegedly. [1]
Convert the assembly language into machine code--yep, we are talking bits and bytes, ones and zeros here.
Check that you have used things like functions and global variables in a consistent way. For example, if you have called a non-existent function, it will complain.
If you are trying to produce an executable from several source code files, work out how to fit them all together.
Work out how to produce something that the system's run-time loader will be able to load into memory and run.
Finally, write the executable on the filesystem.

The word compiling is often used to refer to just steps 1 to 4--the others are referred to as linking. Sometimes step 1 is referred to as pre-processing and steps 3-4 as assembling.

Fortunately, almost all this detail is hidden from you, as cc is a front end that manages calling all these programs with the right arguments for you; simply typing

% cc foobar.c

will cause foobar.c to be compiled by all the steps above. If you have more than one file to compile, just do something like

% cc foo.c bar.c

Note that the syntax checking is just that--checking the syntax. It will not check for any logical mistakes you may have made, like putting the program into an infinite loop, or using a bubble sort when you meant to use a binary sort. [2]

There are lots and lots of options for cc, which are all in the manual page. Here are a few of the most important ones, with examples of how to use them.

-o filename

The output name of the file. If you do not use this option, cc will produce an executable called a.out. [3]

% cc foobar.c               executable is a.out
% cc -o foobar foobar.c     executable is foobar

-c

Just compile the file, do not link it. Useful for toy programs where you just want to check the syntax, or if you are using a Makefile.

% cc -c foobar.c

This will produce an object file (not an executable) called foobar.o. This can be linked together with other object files into an executable.

-g

Create a debug version of the executable. This makes the compiler put information into the executable about which line of which source file corresponds to which function call. A debugger can use this information to show the source code as you step through the program, which is very useful; the disadvantage is that all this extra information makes the program much bigger. Normally, you compile with -g while you are developing a program and then compile a “release version” without -g when you are satisfied it works properly.

% cc -g foobar.c

This will produce a debug version of the program. [4]

-O

Create an optimized version of the executable. The compiler performs various clever tricks to try to produce an executable that runs faster than normal. You can add a number after the -O to specify a higher level of optimization, but this often exposes bugs in the compiler's optimizer. For instance, the version of cc that comes with the 2.1.0 release of FreeBSD is known to produce bad code with the -O2 option in some circumstances.

Optimization is usually only turned on when compiling a release version.

% cc -O -o foobar foobar.c

This will produce an optimized version of foobar.

The following three flags will force cc to check that your code complies to the relevant international standard, often referred to as the ANSI standard, though strictly speaking it is an ISO standard.

-Wall: Enable all the warnings which the authors of cc believe are worthwhile. Despite the name, it will not enable all the warnings cc is capable of.
-ansi: Turn off most, but not all, of the non-ANSI C features provided by cc. Despite the name, it does not guarantee strictly that your code will comply to the standard.
-pedantic: Turn off all cc's non-ANSI C features.

Without these flags, cc will allow you to use some of its non-standard extensions to the standard. Some of these are very useful, but will not work with other compilers--in fact, one of the main aims of the standard is to allow people to write code that will work with any compiler on any system. This is known as portable code.

Generally, you should try to make your code as portable as possible, as otherwise you may have to completely rewrite the program later to get it to work somewhere else--and who knows what you may be using in a few years time?

% cc -Wall -ansi -pedantic -o foobar foobar.c

This will produce an executable foobar after checking foobar.c for standard compliance.

-llibrary

Specify a function library to be used at link time.

The most common example of this is when compiling a program that uses some of the mathematical functions in C. Unlike most other platforms, these are in a separate library from the standard C one and you have to tell the compiler to add it.

The rule is that if the library is called libsomething.a, you give cc the argument -lsomething. For example, the math library is libm.a, so you give cc the argument -lm. A common “gotcha” with the math library is that it has to be the last library on the command line.

% cc -o foobar foobar.c -lm

This will link the math library functions into foobar.

If you are compiling C++ code, you need to add -lg++, or -lstdc++ if you are using FreeBSD 2.2 or later, to the command line argument to link the C++ library functions. Alternatively, you can run c++ instead of cc, which does this for you. c++ can also be invoked as g++ on FreeBSD.

% cc -o foobar foobar.cc -lg++     For FreeBSD 2.1.6 and earlier
% cc -o foobar foobar.cc -lstdc++  For FreeBSD 2.2 and later
% c++ -o foobar foobar.cc

Each of these will both produce an executable foobar from the C++ source file foobar.cc. Note that, on UNIX® systems, C++ source files traditionally end in .C, .cxx or .cc, rather than the MS-DOS® style .cpp (which was already used for something else). gcc used to rely on this to work out what kind of compiler to use on the source file; however, this restriction no longer applies, so you may now call your C++ files .cpp with impunity!

2.4.1 Common `cc` Queries and Problems

2.4.1.1. I am trying to write a program which uses the sin() function and I get an error like this. What does it mean?
2.4.1.2. All right, I wrote this simple program to practice using -lm. All it does is raise 2.1 to the power of 6.
2.4.1.3. So how do I fix this?
2.4.1.4. I compiled a file called foobar.c and I cannot find an executable called foobar. Where has it gone?
2.4.1.5. OK, I have an executable called foobar, I can see it when I run ls, but when I type in foobar at the command prompt it tells me there is no such file. Why can it not find it?
2.4.1.6. I called my executable test, but nothing happens when I run it. What is going on?
2.4.1.7. I compiled my program and it seemed to run all right at first, then there was an error and it said something about “core dumped”. What does that mean?
2.4.1.8. Fascinating stuff, but what I am supposed to do now?
2.4.1.9. When my program dumped core, it said something about a “segmentation fault”. What is that?
2.4.1.10. Sometimes when I get a core dump it says “bus error”. It says in my UNIX book that this means a hardware problem, but the computer still seems to be working. Is this true?
2.4.1.11. This dumping core business sounds as though it could be quite useful, if I can make it happen when I want to. Can I do this, or do I have to wait until there is an error?

2.4.1.1. I am trying to write a program which uses the sin() function and I get an error like this. What does it mean?

/var/tmp/cc0143941.o: Undefined symbol `_sin' referenced from text segment

When using mathematical functions like sin(), you have to tell cc to link in the math library, like so:

% cc -o foobar foobar.c -lm

2.4.1.2. All right, I wrote this simple program to practice using -lm. All it does is raise 2.1 to the power of 6.

#include 

int main() {
    float f;

    f = pow(2.1, 6);
    printf("2.1 ^ 6 = %f\n", f);
    return 0;
}

and I compiled it as:

% cc temp.c -lm

like you said I should, but I get this when I run it:

% ./a.out
2.1 ^ 6 = 1023.000000

This is not the right answer! What is going on?

When the compiler sees you call a function, it checks if it has already seen a prototype for it. If it has not, it assumes the function returns an int, which is definitely not what you want here.

2.4.1.3. So how do I fix this?

The prototypes for the mathematical functions are in math.h. If you include this file, the compiler will be able to find the prototype and it will stop doing strange things to your calculation!

#include 
#include 

int main() {
...

After recompiling it as you did before, run it:

% ./a.out
2.1 ^ 6 = 85.766121

If you are using any of the mathematical functions, always include math.h and remember to link in the math library.

2.4.1.4. I compiled a file called foobar.c and I cannot find an executable called foobar. Where has it gone?

Remember, cc will call the executable a.out unless you tell it differently. Use the -o filename option:

% cc -o foobar foobar.c

2.4.1.5. OK, I have an executable called foobar, I can see it when I run ls, but when I type in foobar at the command prompt it tells me there is no such file. Why can it not find it?

Unlike MS-DOS, UNIX does not look in the current directory when it is trying to find out which executable you want it to run, unless you tell it to. Either type ./foobar, which means “run the file called foobar in the current directory”, or change your PATH environment variable so that it looks something like

bin:/usr/bin:/usr/local/bin:.

The dot at the end means “look in the current directory if it is not in any of the others”.

2.4.1.6. I called my executable test, but nothing happens when I run it. What is going on?

Most UNIX systems have a program called test in /usr/bin and the shell is picking that one up before it gets to checking the current directory. Either type:

% ./test

or choose a better name for your program!

2.4.1.7. I compiled my program and it seemed to run all right at first, then there was an error and it said something about “core dumped”. What does that mean?

The name core dump dates back to the very early days of UNIX, when the machines used core memory for storing data. Basically, if the program failed under certain conditions, the system would write the contents of core memory to disk in a file called core, which the programmer could then pore over to find out what went wrong.

2.4.1.8. Fascinating stuff, but what I am supposed to do now?

Use gdb to analyze the core (see Section 2.6).

2.4.1.9. When my program dumped core, it said something about a “segmentation fault”. What is that?

This basically means that your program tried to perform some sort of illegal operation on memory; UNIX is designed to protect the operating system and other programs from rogue programs.

Common causes for this are:

Trying to write to a NULL pointer, eg

char *foo = NULL;
strcpy(foo, "bang!");

Using a pointer that has not been initialized, eg
```
char *foo;
strcpy(foo, "bang!");
       
```
The pointer will have some random value that, with luck, will point into an area of memory that is not available to your program and the kernel will kill your program before it can do any damage. If you are unlucky, it will point somewhere inside your own program and corrupt one of your data structures, causing the program to fail mysteriously.
Trying to access past the end of an array, eg
```
int bar[20];
bar[27] = 6;
       
```
Trying to store something in read-only memory, eg
```
char *foo = "My string";
strcpy(foo, "bang!");
       
```
UNIX compilers often put string literals like "My string" into read-only areas of memory.

Doing naughty things with malloc() and free(), eg

char bar[80];
free(bar);

char *foo = malloc(27);
free(foo);
free(foo);

Making one of these mistakes will not always lead to an error, but they are always bad practice. Some systems and compilers are more tolerant than others, which is why programs that ran well on one system can crash when you try them on an another.

2.4.1.10. Sometimes when I get a core dump it says “bus error”. It says in my UNIX book that this means a hardware problem, but the computer still seems to be working. Is this true?

No, fortunately not (unless of course you really do have a hardware problem...). This is usually another way of saying that you accessed memory in a way you should not have.

2.4.1.11. This dumping core business sounds as though it could be quite useful, if I can make it happen when I want to. Can I do this, or do I have to wait until there is an error?

Yes, just go to another console or xterm, do

% ps

to find out the process ID of your program, and do

% kill -ABRT pid

where pid is the process ID you looked up.

This is useful if your program has got stuck in an infinite loop, for instance. If your program happens to trap SIGABRT, there are several other signals which have a similar effect.

Alternatively, you can create a core dump from inside your program, by calling the abort() function. See the manual page of abort(3) to learn more.

If you want to create a core dump from outside your program, but do not want the process to terminate, you can use the gcore program. See the manual page of gcore(1) for more information.

Notes

[1]	To be strictly accurate, `cc` converts the source code into its own, machine-independent p-code instead of assembly language at this stage.
[2]	In case you did not know, a binary sort is an efficient way of sorting things into order and a bubble sort is not.
[3]	The reasons for this are buried in the mists of history.
[4]	Note, we did not use the `-o` flag to specify the executable name, so we will get an executable called `a.out`. Producing a debug version called `foobar` is left as an exercise for the reader!

Prev	Home	Next
Introduction to Programming	Up	Make

This, and other documents, can be downloaded from ftp://ftp.FreeBSD.org/pub/FreeBSD/doc/.

For questions about FreeBSD, read the documentation before contacting <questions@FreeBSD.org>.
For questions about this documentation, e-mail <doc@FreeBSD.org>.

幽灵狼 2005-11-11 15:23 发表评论

Make

幽灵狼 — Fri, 11 Nov 2005 07:23:00 GMT

2.5 Make

2.5.1 What is `make`?

When you are working on a simple program with only one or two source files, typing in

% cc file1.c file2.c

is not too bad, but it quickly becomes very tedious when there are several files--and it can take a while to compile, too.

One way to get around this is to use object files and only recompile the source file if the source code has changed. So we could have something like:

% cc file1.o file2.o ... file37.c ...

if we had changed file37.c, but not any of the others, since the last time we compiled. This may speed up the compilation quite a bit, but does not solve the typing problem.

Or we could write a shell script to solve the typing problem, but it would have to re-compile everything, making it very inefficient on a large project.

What happens if we have hundreds of source files lying about? What if we are working in a team with other people who forget to tell us when they have changed one of their source files that we use?

Perhaps we could put the two solutions together and write something like a shell script that would contain some kind of magic rule saying when a source file needs compiling. Now all we need now is a program that can understand these rules, as it is a bit too complicated for the shell.

This program is called make. It reads in a file, called a makefile, that tells it how different files depend on each other, and works out which files need to be re-compiled and which ones do not. For example, a rule could say something like “if fromboz.o is older than fromboz.c, that means someone must have changed fromboz.c, so it needs to be re-compiled.” The makefile also has rules telling make how to re-compile the source file, making it a much more powerful tool.

Makefiles are typically kept in the same directory as the source they apply to, and can be called makefile, Makefile or MAKEFILE. Most programmers use the name Makefile, as this puts it near the top of a directory listing, where it can easily be seen. [1]

2.5.2 Example of using `make`

Here is a very simple make file:

foo: foo.c
    cc -o foo foo.c

It consists of two lines, a dependency line and a creation line.

The dependency line here consists of the name of the program (known as the target), followed by a colon, then whitespace, then the name of the source file. When make reads this line, it looks to see if foo exists; if it exists, it compares the time foo was last modified to the time foo.c was last modified. If foo does not exist, or is older than foo.c, it then looks at the creation line to find out what to do. In other words, this is the rule for working out when foo.c needs to be re-compiled.

The creation line starts with a tab (press the tab key) and then the command you would type to create foo if you were doing it at a command prompt. If foo is out of date, or does not exist, make then executes this command to create it. In other words, this is the rule which tells make how to re-compile foo.c.

So, when you type make, it will make sure that foo is up to date with respect to your latest changes to foo.c. This principle can be extended to Makefiles with hundreds of targets--in fact, on FreeBSD, it is possible to compile the entire operating system just by typing make world in the appropriate directory!

Another useful property of makefiles is that the targets do not have to be programs. For instance, we could have a make file that looks like this:

foo: foo.c
    cc -o foo foo.c

install:
    cp foo /home/me

We can tell make which target we want to make by typing:

% make target

make will then only look at that target and ignore any others. For example, if we type make foo with the makefile above, make will ignore the install target.

If we just type make on its own, make will always look at the first target and then stop without looking at any others. So if we typed make here, it will just go to the foo target, re-compile foo if necessary, and then stop without going on to the install target.

Notice that the install target does not actually depend on anything! This means that the command on the following line is always executed when we try to make that target by typing make install. In this case, it will copy foo into the user's home directory. This is often used by application makefiles, so that the application can be installed in the correct directory when it has been correctly compiled.

This is a slightly confusing subject to try to explain. If you do not quite understand how make works, the best thing to do is to write a simple program like “hello world” and a make file like the one above and experiment. Then progress to using more than one source file, or having the source file include a header file. The touch command is very useful here--it changes the date on a file without you having to edit it.

2.5.3 Make and include-files

C code often starts with a list of files to include, for example stdio.h. Some of these files are system-include files, some of them are from the project you are now working on:

#include 
#include "foo.h"

int main(....

To make sure that this file is recompiled the moment foo.h is changed, you have to add it in your Makefile:

foo: foo.c foo.h

The moment your project is getting bigger and you have more and more own include-files to maintain, it will be a pain to keep track of all include files and the files which are depending on it. If you change an include-file but forget to recompile all the files which are depending on it, the results will be devastating. gcc has an option to analyze your files and to produce a list of include-files and their dependencies: -MM.

If you add this to your Makefile:

depend:
    gcc -E -MM *.c > .depend

and run make depend, the file .depend will appear with a list of object-files, C-files and the include-files:

foo.o: foo.c foo.h

If you change foo.h, next time you run make all files depending on foo.h will be recompiled.

Do not forget to run make depend each time you add an include-file to one of your files.

2.5.4 FreeBSD Makefiles

Makefiles can be rather complicated to write. Fortunately, BSD-based systems like FreeBSD come with some very powerful ones as part of the system. One very good example of this is the FreeBSD ports system. Here is the essential part of a typical ports Makefile:

MASTER_SITES=   ftp://freefall.cdrom.com/pub/FreeBSD/LOCAL_PORTS/
DISTFILES=      scheme-microcode+dist-7.3-freebsd.tgz

.include

Now, if we go to the directory for this port and type make, the following happens:

A check is made to see if the source code for this port is already on the system.
If it is not, an FTP connection to the URL in MASTER_SITES is set up to download the source.
The checksum for the source is calculated and compared it with one for a known, good, copy of the source. This is to make sure that the source was not corrupted while in transit.
Any changes required to make the source work on FreeBSD are applied--this is known as patching.
Any special configuration needed for the source is done. (Many UNIX® program distributions try to work out which version of UNIX they are being compiled on and which optional UNIX features are present--this is where they are given the information in the FreeBSD ports scenario).
The source code for the program is compiled. In effect, we change to the directory where the source was unpacked and do make--the program's own make file has the necessary information to build the program.
We now have a compiled version of the program. If we wish, we can test it now; when we feel confident about the program, we can type make install. This will cause the program and any supporting files it needs to be copied into the correct location; an entry is also made into a package database, so that the port can easily be uninstalled later if we change our mind about it.

Now I think you will agree that is rather impressive for a four line script!

The secret lies in the last line, which tells make to look in the system makefile called bsd.port.mk. It is easy to overlook this line, but this is where all the clever stuff comes from--someone has written a makefile that tells make to do all the things above (plus a couple of other things I did not mention, including handling any errors that may occur) and anyone can get access to that just by putting a single line in their own make file!

If you want to have a look at these system makefiles, they are in /usr/share/mk, but it is probably best to wait until you have had a bit of practice with makefiles, as they are very complicated (and if you do look at them, make sure you have a flask of strong coffee handy!)

2.5.5 More advanced uses of `make`

Make is a very powerful tool, and can do much more than the simple example above shows. Unfortunately, there are several different versions of make, and they all differ considerably. The best way to learn what they can do is probably to read the documentation--hopefully this introduction will have given you a base from which you can do this.

The version of make that comes with FreeBSD is the Berkeley make; there is a tutorial for it in /usr/share/doc/psd/12.make. To view it, do

% zmore paper.ascii.gz

in that directory.

Many applications in the ports use GNU make, which has a very good set of “info” pages. If you have installed any of these ports, GNU make will automatically have been installed as gmake. It is also available as a port and package in its own right.

To view the info pages for GNU make, you will have to edit the dir file in the /usr/local/info directory to add an entry for it. This involves adding a line like

 * Make: (make).                 The GNU Make utility.

to the file. Once you have done this, you can type info and then select make from the menu (or in Emacs, do C-h i).

Notes

[1]	They do not use the `MAKEFILE` form as block capitals are often used for documentation files like `README`.

Prev	Home	Next
Compiling with `cc`	Up	Debugging

This, and other documents, can be downloaded from ftp://ftp.FreeBSD.org/pub/FreeBSD/doc/.

For questions about FreeBSD, read the documentation before contacting <questions@FreeBSD.org>.
For questions about this documentation, e-mail <doc@FreeBSD.org>.

幽灵狼 2005-11-11 15:23 发表评论

IT博客-幽灵狼-随笔分类-Development

Writing Programs with NCURSES

Linux Netlink Sockets

Basic Introduction

Programming Netlink

Applications Known to Use Netlink Sockets

Random notes (things I wish were documented somewhere but aren't)

Linux Netlink Socket Example

Netlink Socket for Linux

Kernel Korner - Why and How to Use Netlink Socket

Introduction

Relating to the BSD Routing Socket

Netlink Socket APIs

bind()

Sending a Netlink Message

Receiving Netlink Messages

Kernel-Space Netlink APIs

Sending Netlink Messages from the Kernel

Closing a Netlink Socket from the Kernel

Unicast Communication between Kernel and Application

Multicast Communication between Kernel and Applications

Conclusion

I/O in FreeBSD

Linux国际化本地化和中文化

Author: 于明俭

一 国际化、本地化和中文化

二 Locale

三 X 窗口系统的国际化

四 开发符合国际化标准的软件

五 目前中文化中存在的问题

六 附录

Linux Unicode编程

NetBSD Code Style

ANSI Escape Sequence

ANSI Escape Sequence

Clear Display

Cursor Movement

Save and Restore Cursor Position

Character Mode

ANSI.SYS

Problem when compiling unpv12e in FreeBSD(no problems with unpv13e)

Debugging

Compiling

Notes

Make

Notes

一国际化、本地化和中文化

四开发符合国际化标准的软件

五目前中文化中存在的问题

六附录