【Linux Tutorial】Linux select

1

prototype

int select (int __nfds, fd_set *__restrict __readfds,

As you know, select is an implementation of multiple IO multiplexing. It divides the fd that needs to be monitored into three categories: read, write, and exception. It uses fd_set to indicate that when it returns, it either times out or has at least one A read, write or exception event occurs.

2

related data structures

FD_SET

FD_SET is the most important [data structure] of select, and its definition in the kernel is as follows:

typedef __kernel_fd_set     fd_set;

Let’s simplify, fd_set is a struct with only an unsigned long array consisting of 16 elements inside. This array can represent a total of 16 × 64 = 1024 bits, and each bit is used to represent an fd, which is what select is for Read, definite or abnormal each class can only have a maximum of 1024 origins of the fd limit.

3

Related macros

The following macros are defined in the kernel code fs/select.c:

  • FDS_BITPERLONG: returns how many bits each long has, usually 64bits

#define FDS_BITPERLONG    (8*sizeof(long))

  • FDS_LONGS(nr): Obtaining nr fds requires several longs to represent

#define FDS_LONGS(nr) (((nr)+FDS_BITPERLONG-1)/FDS_BITPERLONG)

  • FD_BYTES(nr): How many bytes are needed to obtain nr fds

#define FDS_BYTES(nr) (FDS_LONGS(nr)*sizeof(long))

The following macros can be found in the gcc source code:

  • FD_ZERO: Initialize a fd_set

#define __FD_ZERO(s) \

Set each element of the above-mentioned 16-element unsigned long array to 0;

  • __FD_SET(d, s): assign an fd to an fd_set

#define __FD_SET(d, s) \

In three steps:

a. __FD_ELT(d): Determine which element of the array is assigned to

#define   __FD_ELT(d) \

where #define __NFDBITS (8 * (int) sizeof (__fd_mask)) ie __NFDBITS = 64

The implementation here uses __builtin_constant_p to optimize constants. I don’t understand the difference between constants and non-constant implementations. Let’s ignore this detail for the time being to see the essence.

The essence is that an unsigned long has 64 bits, and directly __d / __NFDBITS modulo can determine which element of the array to use;

b. __FD_MASK(d): Determine which bit of an unsigned long is assigned to

#define   __FD_MASK(d)    ((__fd_mask) (1UL << ((d) % __NFDBITS)))

Directly (d) % __NFDBITS) take the remainder as the number of digits shifted to the left by 1

c. |= : You can use bitwise or assignment;

4

Implementation in the kernel

call level

System call entry location

located in fs/select.c

SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp,

Entry function kern_select

static int kern_select(int n, fd_set __user *inp, fd_set __user *outp,

Do three things:

a. If a timeout is set, first prepare the timestamp timespec64;

b. Call core_sys_select, this is the specific implementation, we will focus on it below

c. poll_select_finish: The main job is to update the timeout parameter tvp passed in when the user calls select. Let me list the key codes:

ktime_get_ts64(&rts);

It can be seen that the current timestamp is obtained first, and then the difference is calculated through timespec64_sub and the incoming timestamp (the timeout is passed in, which will be converted into a timestamp when implemented), and the difference is returned to the user. , which returns the remaining timeout period. So this place is a small trap. When the user calls select, the timeout period needs to be reinitialized every time.

Implemented by core_sys_select

The main function of this function is to prepare the fd_set before implementing the real select function, that is, copy the required three types of fd_set from the user space to the kernel space. From the code below you will see that for each select system call, the required three types of fd_sets need to be copied from user space to kernel space, and there is a performance penalty here.

int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,

The comments in the code are very clear, let’s briefly go through it here, in five steps:

  • Normalize the first parameter n passed in by the select system call

/* max_fds can increase, so grab it once to avoid race */

This n is the maximum value of the fd values ​​included in the three different fd_sets + 1, the linux task open handle starts from 0, and if 1 is not added, it may monitor less fd.

Users can be lazy when using it, that is, set this n to FD_SETSIZE, usually 1024, which expands the scope of monitoring to the upper limit, but in fact, there are far less fds to monitor and waste resources.

The explanation in the linux man is as follows:

nfds should be set to the highest-numbered file descriptor in any of the three sets, plus 1. The indicated file descriptors in each set are checked, up to this limit (but see BUGS).

  • To calculate the space of fd_set required by kernel space, three fd_sets are needed in kernel mode to accommodate the parameters passed from user mode, and three fd_sets are needed to accommodate the three fd_sets generated after the return of the select call, that is, a total of 6 fd_sets

long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];

Here’s a little trick, first allocate space from the kernel stack, and if it’s not enough, use kvmalloc to allocate it.

Calculate the number of bytes required for a single fd_set by size = FDS_BYTES(n);

Then pass alloc_size = 6 * size; to calculate the total number of bytes required.

  • Initializes two types of fd_sets used as parameters and used as return values

if ((ret = get_fd_set(n, inp, fds.in)) ||

Mainly use copy_from_user and memset to achieve.

  • The real implementation of part of do_select, we will talk about it in detail below

  • The returned result is copied back to user space

if (set_fd_set(n, inp, fds.res_in) ||

Here is another copy from kernel space to user space, and we see that the return value is also represented by the fd_set structure, which means that we also need to traverse every bit in user space processing.

The essence of do_select

A very important data structure wait queue in Linux is used here. We will not expand it for now. Let’s briefly talk about its usage. For example, when we read in the process, we often have to wait for the data to be ready. We use pseudocode to write it. Some processes:

// Read code

The 360 ​​private cloud platform (HULK platform) manages more than 90% of the business lines of the 360 ​​company. How to manage such a large number of servers? Of course a well-established set of tools is required to automate. The command system of the HULK platform can execute scripts on batch machines. The bottom layer of the command system is developed based on SaltStack. Of course, the biggest problem is that the machines are deployed in many computer rooms.

do_select source code walk

  • Get the largest fd in the current three types of fd_set

rcu_read_lock();

The n in n = retval above is the largest fd in the three types of fd_set, and it is also the upper limit of the loop body to be introduced below

  • Initialize the private data structure used as the wait entry in the wait queue

poll_initwait(&table);

  • core loop body

    We say comments are written in this code block

// The outermost is an infinite loop, which will exit only when poll reaches a valid event, or times out, or an interrupt occurs

A brief summary is as follows:

a. Loop through each monitored fd;

b. Return in one of the following cases:

An event occurs on any monitored fd

c. If the above situation is not found, set the current process to TASK_INTERRUPTIBLE, and call schedule for process switching;

d. Waiting for the socket event to occur, after the corresponding socket wakes up the current process, the current process is re-scheduled and switched back to continue running;

If you are careful, you may have found that this has a problem that affects efficiency: even if only one monitored fd has an event, the current process will be woken up, and then all monitored fds will be traversed, and vfs_poll will be called in turn to get its Effective event, so troublesome~~

5

vfs_poll explanation

  • call level

  • effect

Initialize the wait entry and add it to the waiting queue of the socket corresponding to this fd

  • When joining the waiting queue, __pollwait in fs_select.c will eventually be called

static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,

6

Summarize

  • Each type of fd_set in the select call can accommodate up to 1024 fds;

  • Each call to select requires three types of fd_set to be copied from user space to kernel space;

  • Wait queue is a good thing, select will be added to the waiting queue of each monitored socket by the current process task;

  • After the select process is awakened, even if only one monitored fd has an event, it will traverse all the monitored fds once again;

  • In the process of traversing the fd, cond_resched() will be called to actively sell the CPU for process switching;

Service recommendation

Leave a Comment

Your email address will not be published. Required fields are marked *