Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering Linux Kernel Development
Mastering Linux Kernel Development

Mastering Linux Kernel Development: A kernel developer's reference manual

eBook
$29.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Mastering Linux Kernel Development

Comprehending Processes, Address Space, and Threads

When kernel services are invoked in the current process context, its layout throws open the right path for exploring kernels in more detail. Our effort in this chapter is centered around comprehending processes and the underlying ecosystem the kernel provides for them. We will explore the following concepts in this chapter:

  • Program to process
  • Process layout
  • Virtual address spaces
  • Kernel and user space
  • Process APIs
  • Process descriptors
  • Kernel stack management
  • Threads
  • Linux thread API
  • Data structures
  • Namespace and cgroups

Processes

Quintessentially, computing systems are designed, developed, and often tweaked for running user applications efficiently. Every element that goes into a computing platform is intended to enable effective and efficient ways for running applications. In other words, computing systems exist to run diverse application programs. Applications can run either as firmware in dedicated devices or as a "process" in systems driven by system software (operating systems).

At its core, a process is a running instance of a program in memory. The transformation from a program to a process happens when the program (on disk) is fetched into memory for execution.

A program’s binary image carries code (with all its binary instructions) and data (with all global data), which are mapped to distinct regions of memory with appropriate access permissions (read, write, and execute). Apart from code and data, a process is assigned additional memory regions called stack (for allocation of function call frames with auto variables and function arguments) and heap for dynamic allocations at runtime.

Multiple instances of the same program can exist with their respective memory allocations. For instance, for a web browser with multiple open tabs (running simultaneous browsing sessions), each tab is considered a process instance by the kernel, with unique memory allocations.

The following figure represents the layout of processes in memory:

The illusion called address space

Modern-day computing platforms are expected to handle a plethora of processes efficiently. Operating systems thus must deal with allocating unique memory to all contending processes within the physical memory (often finite) and also ensure their reliable execution. With multiple processes contending and executing simultaneously (multi-tasking), the operating system must ensure that the memory allocation of every process is protected from accidental access by another process.

To address this issue, the kernel provides a level of abstraction between the process and the physical memory called virtual address space. Virtual address space is the process' view of memory; it is how the running program views the memory.

Virtual address space creates an illusion that every process exclusively owns the whole memory while executing. This abstracted view of memory is called virtual memory and is achieved by the kernel's memory manager in coordination with the CPU's MMU. Each process is given a contiguous 32 or 64-bit address space, bound by the architecture and unique to that process. With each process caged into its virtual address space by the MMU, any attempt by a process to access an address region outside its boundaries will trigger a hardware fault, making it possible for the memory manger to detect and terminate violating processes, thus ensuring protection.

The following figure depicts the illusion of address space created for every contending process:

Kernel and user space

Modern operating systems not only prevent one process from accessing another but also prevent processes from accidentally accessing or manipulating kernel data and services (as the kernel is shared by all the processes).

Operating systems achieve this protection by segmenting the whole memory into two logical halves, the user and kernel space. This bifurcation ensures that all processes that are assigned address spaces are mapped to the user space section of memory and kernel data and services run in kernel space. The kernel achieves this protection in coordination with the hardware. While an application process is executing instructions from its code segment, the CPU is operating in user mode. When a process intends to invoke a kernel service, it needs to switch the CPU into privileged mode (kernel mode), which is achieved through special functions called APIs (application programming interfaces). These APIs enable user processes to switch into the kernel space using special CPU instructions and then execute the required services through system calls. On completion of the requested service, the kernel executes another mode switch, this time back from kernel mode to user mode, using another set of CPU instructions.

System calls are the kernel's interfaces to expose its services to application processes; they are also called kernel entry points. As system calls are implemented in kernel space, the respective handlers are provided through APIs in the user space. API abstraction also makes it easier and convenient to invoke related system calls.

The following figure depicts a virtualized memory view:

Process context

When a process requests a kernel service through a system call, the kernel will execute on behalf of the caller process. The kernel is now said to be executing in process context. Similarly, the kernel also responds to interrupts raised by other hardware entities; here, the kernel executes in interrupt context. When in interrupt context, the kernel is not running on behalf of any process.

Process descriptors

Right from the time a process is born until it exits, it’s the kernel's process management subsystem that carries out various operations, ranging from process creation, allocating CPU time, and event notifications to destruction of the process upon termination.

Apart from the address space, a process in memory is also assigned a data structure called the process descriptor, which the kernel uses to identify, manage, and schedule the process. The following figure depicts process address spaces with their respective process descriptors in the kernel:

In Linux, a process descriptor is an instance of type struct task_struct defined in <linux/sched.h>, it is one of the central data structures, and contains all the attributes, identification details, and resource allocation entries that a process holds. Looking at struct task_struct is like a peek into the window of what the kernel sees or works with to manage and schedule a process.

Since the task structure contains a wide set of data elements, which are related to the functionality of various kernel subsystems, it would be out of context to discuss the purpose and scope of all the elements in this chapter. We shall consider a few important elements that are related to process management.

Process attributes - key elements

Process attributes define all the key and fundamental characteristics of a process. These elements contain the process's state and identifications along with other key values of importance.

state

A process right from the time it is spawned until it exits may exist in various states, referred to as process states--they define the process’s current state:

  • TASK_RUNNING (0): The task is either executing or contending for CPU in the scheduler run-queue.
  • TASK_INTERRUPTIBLE (1): The task is in an interruptible wait state; it remains in wait until an awaited condition becomes true, such as the availability of mutual exclusion locks, device ready for I/O, lapse of sleep time, or an exclusive wake-up call. While in this wait state, any signals generated for the process are delivered, causing it to wake up before the wait condition is met.
  • TASK_KILLABLE: This is similar to TASK_INTERRUPTIBLE, with the exception that interruptions can only occur on fatal signals, which makes it a better alternative to TASK_INTERRUPTIBLE.
  • TASK_UNINTERRUTPIBLE (2): The task is in uninterruptible wait state similar to TASK_INTERRUPTIBLE, except that generated signals to the sleeping process do not cause wake-up. When the event occurs for which it is waiting, the process transitions to TASK_RUNNING. This process state is rarely used.
  • TASK_ STOPPED (4): The task has received a STOP signal. It will be back to running on receiving the continue signal (SIGCONT).
  • TASK_TRACED (8): A process is said to be in traced state when it is being combed, probably by a debugger.
  • EXIT_ZOMBIE (32): The process is terminated, but its resources are not yet reclaimed.
  • EXIT_DEAD (16): The child is terminated and all the resources held by it freed, after the parent collects the exit status of the child using wait.

The following figure depicts process states:

pid

This field contains a unique process identifier referred to as PID. PIDs in Linux are of the type pid_t (integer). Though a PID is an integer, the default maximum number PIDs is 32,768 specified through the /proc/sys/kernel/pid_max interface. The value in this file can be set to any value up to 222 (PID_MAX_LIMIT, approximately 4 million).

To manage PIDs, the kernel uses a bitmap. This bitmap allows the kernel to keep track of PIDs in use and assign a unique PID for new processes. Each PID is identified by a bit in the PID bitmap; the value of a PID is determined from the position of its corresponding bit. Bits with value 1 in the bitmap indicate that the corresponding PIDs are in use, and those with value 0 indicate free PIDs. Whenever the kernel needs to assign a unique PID, it looks for the first unset bit and sets it to 1, and conversely to free a PID, it toggles the corresponding bit from 1 to 0.

tgid

This field contains the thread group id. For easy understanding, let's say when a new process is created, its PID and TGID are the same, as the process happens to be the only thread. When the process spawns a new thread, the new child gets a unique PID but inherits the TGID from the parent, as it belongs to the same thread group. The TGID is primarily used to support multi-threaded process. We will delve into further details in the threads section of this chapter.

thread info

This field holds processor-specific state information, and is a critical element of the task structure. Later sections of this chapter contain details about the importance of thread_info.

flags

The flags field records various attributes corresponding to a process. Each bit in the field corresponds to various stages in the lifetime of a process. Per-process flags are defined in <linux/sched.h>:

#define PF_EXITING           /* getting shut down */
#define PF_EXITPIDONE /* pi exit done on shut down */
#define PF_VCPU /* I'm a virtual CPU */
#define PF_WQ_WORKER /* I'm a workqueue worker */
#define PF_FORKNOEXEC /* forked but didn't exec */
#define PF_MCE_PROCESS /* process policy on mce errors */
#define PF_SUPERPRIV /* used super-user privileges */
#define PF_DUMPCORE /* dumped core */
#define PF_SIGNALED /* killed by a signal */
#define PF_MEMALLOC /* Allocating memory */
#define PF_NPROC_EXCEEDED /* set_user noticed that RLIMIT_NPROC was exceeded */
#define PF_USED_MATH /* if unset the fpu must be initialized before use */
#define PF_USED_ASYNC /* used async_schedule*(), used by module init */
#define PF_NOFREEZE /* this thread should not be frozen */
#define PF_FROZEN /* frozen for system suspend */
#define PF_FSTRANS /* inside a filesystem transaction */
#define PF_KSWAPD /* I am kswapd */
#define PF_MEMALLOC_NOIO0 /* Allocating memory without IO involved */
#define PF_LESS_THROTTLE /* Throttle me less: I clean memory */
#define PF_KTHREAD /* I am a kernel thread */
#define PF_RANDOMIZE /* randomize virtual address space */
#define PF_SWAPWRITE /* Allowed to write to swap */
#define PF_NO_SETAFFINITY /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY /* Early kill for mce process policy */
#define PF_MUTEX_TESTER /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK /* this thread called freeze_processes and should not be frozen */

exit_code and exit_signal

These fields contain the exit value of the task and details of the signal that caused the termination. These fields are to be accessed by the parent process through wait() on termination of the child.

comm

This field holds the name of the binary executable used to start the process.

ptrace

This field is enabled and set when the process is put into trace mode using the ptrace() system call.

Process relations - key elements

Every process can be related to a parent process, establishing a parent-child relationship. Similarly, multiple processes spawned by the same process are called siblings. These fields establish how the current process relates to another process.

real_parent and parent

These are pointers to the parent's task structure. For a normal process, both these pointers refer to the same task_struct; they only differ for multi-thread processes, implemented using posix threads. For such cases, real_parent refers to the parent thread task structure and parent refers the process task structure to which SIGCHLD is delivered.

children

This is a pointer to a list of child task structures.

sibling

This is a pointer to a list of sibling task structures.

group_leader

This is a pointer to the task structure of the process group leader.

Scheduling attributes - key elements

All contending processes must be given fair CPU time, and this calls for scheduling based on time slices and process priorities. These attributes contain necessary information that the scheduler uses when deciding on which process gets priority when contending.

prio and static_prio

prio helps determine the priority of the process for scheduling. This field holds static priority of the process within the range 1 to 99 (as specified by sched_setscheduler()) if the process is assigned a real-time scheduling policy. For normal processes, this field holds a dynamic priority derived from the nice value.

se, rt, and dl

Every task belongs to a scheduling entity (group of tasks), as scheduling is done at a per-entity level. se is for all normal processes, rt is for real-time processes, and dl is for deadline processes. We will discuss more on these attributes in the next chapter on scheduling.

policy

This field contains information about the scheduling policy of the process, which helps in determining its priority.

cpus_allowed

This field specifies the CPU mask for the process, that is, on which CPU(s) the process is eligible to be scheduled in a multi-processor system.

rt_priority

This field specifies the priority to be applied by real-time scheduling policies. For non-real-time processes, this field is unused.

Process limits - key elements

The kernel imposes resource limits to ensure fair allocation of system resources among contending processes. These limits guarantee that a random process does not monopolize ownership of resources. There are 16 different types of resource limits, and the task structure points to an array of type struct rlimit, in which each offset holds the current and maximum values for a specific resource.

/*include/uapi/linux/resource.h*/
struct rlimit {
__kernel_ulong_t rlim_cur;
__kernel_ulong_t rlim_max;
};
These limits are specified in include/uapi/asm-generic/resource.h

#define RLIMIT_CPU 0 /* CPU time in sec */
#define RLIMIT_FSIZE 1 /* Maximum filesize */
#define RLIMIT_DATA 2 /* max data size */
#define RLIMIT_STACK 3 /* max stack size */
#define RLIMIT_CORE 4 /* max core file size */
#ifndef RLIMIT_RSS
# define RLIMIT_RSS 5 /* max resident set size */
#endif
#ifndef RLIMIT_NPROC
# define RLIMIT_NPROC 6 /* max number of processes */
#endif
#ifndef RLIMIT_NOFILE
# define RLIMIT_NOFILE 7 /* max number of open files */
#endif
#ifndef RLIMIT_MEMLOCK
# define RLIMIT_MEMLOCK 8 /* max locked-in-memory
address space */
#endif
#ifndef RLIMIT_AS
# define RLIMIT_AS 9 /* address space limit */
#endif
#define RLIMIT_LOCKS 10 /* maximum file locks held */
#define RLIMIT_SIGPENDING 11 /* max number of pending signals */
#define RLIMIT_MSGQUEUE 12 /* maximum bytes in POSIX mqueues */
#define RLIMIT_NICE 13 /* max nice prio allowed to
raise to 0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
#define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
#define RLIM_NLIMITS 16

File descriptor table - key elements

During the lifetime of a process, it may access various resource files to get its task done. This results in the process opening, closing, reading, and writing to these files. The system must keep track of these activities; file descriptor elements help the system know which files the process holds.

fs

Filesystem information is stored in this field.

files

The file descriptor table contains pointers to all the files that a process opens to perform various operations. The files field contains a pointer, which points to this file descriptor table.

Signal descriptor - key elements

For processes to handle signals, the task structure has various elements that determine how the signals must be handled.

signal

This is of type struct signal_struct, which contains information on all the signals associated with the process.

sighand

This is of type struct sighand_struct, which contains all signal handlers associated with the process.

sigset_t blocked, real_blocked

These elements identify signals that are currently masked or blocked by the process.

pending

This is of type struct sigpending, which identifies signals which are generated but not yet delivered.

sas_ss_sp

This field contains a pointer to an alternate stack, which facilitates signal handling.

sas_ss_size

This filed shows the size of the alternate stack, used for signal handling.

Kernel stack

With current-generation computing platforms powered by multi-core hardware capable of running simultaneous applications, the possibility of multiple processes concurrently initiating kernel mode switch when requesting for the same process is built in. To be able to handle such situations, kernel services are designed to be re-entrant, allowing multiple processes to step in and engage the required services. This mandated the requesting process to maintain its own private kernel stack to keep track of the kernel function call sequence, store local data of the kernel functions, and so on.

The kernel stack is directly mapped to the physical memory, mandating the arrangement to be physically in a contiguous region. The kernel stack by default is 8kb for x86-32 and most other 32-bit systems (with an option of 4k kernel stack to be configured during kernel build), and 16kb on an x86-64 system.

When kernel services are invoked in the current process context, they need to validate the process’s prerogative before it commits to any relevant operations. To perform such validations, the kernel services must gain access to the task structure of the current process and look through the relevant fields. Similarly, kernel routines might need to have access to the current task structure for modifying various resource structures such as signal handler tables, looking for pending signals, file descriptor table, and memory descriptor among others. To enable accessing the task structure at runtime, the address of the current task structure is loaded into a processor register (register chosen is architecture specific) and made available through a kernel global macro called current (defined in architecture-specific kernel header asm/current.h ):

  /* arch/ia64/include/asm/current.h */
#ifndef _ASM_IA64_CURRENT_H
#define _ASM_IA64_CURRENT_H
/*
* Modified 1998-2000
* David Mosberger-Tang <[email protected]>, Hewlett-Packard Co
*/
#include <asm/intrinsics.h>
/*
* In kernel mode, thread pointer (r13) is used to point to the
current task
* structure.
*/
#define current ((struct task_struct *) ia64_getreg(_IA64_REG_TP))
#endif /* _ASM_IA64_CURRENT_H */
/* arch/powerpc/include/asm/current.h */
#ifndef _ASM_POWERPC_CURRENT_H
#define _ASM_POWERPC_CURRENT_H
#ifdef __KERNEL__
/*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
struct task_struct;
#ifdef __powerpc64__
#include <linux/stddef.h>
#include <asm/paca.h>
static inline struct task_struct *get_current(void)
{
struct task_struct *task;

__asm__ __volatile__("ld %0,%1(13)"
: "=r" (task)
: "i" (offsetof(struct paca_struct, __current)));
return task;
}
#define current get_current()
#else
/*
* We keep `current' in r2 for speed.
*/
register struct task_struct *current asm ("r2");
#endif
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_CURRENT_H */

However, in register-constricted architectures, where there are few registers to spare, reserving a register to hold the address of the current task structure is not viable. On such platforms, the task structure of the current process is directly made available at the top of the kernel stack that it owns. This approach renders a significant advantage with respect to locating the task structure, by just masking the least significant bits of the stack pointer.

With the evolution of the kernel, the task structure grew and became too large to be contained in the kernel stack, which is already restricted in physical memory (8Kb). As a result, the task structure was moved out of the kernel stack, barring a few key fields that define the process's CPU state and other low-level processor-specific information. These fields were then wrapped in a newly created structure called struct thread_info. This structure is contained on top of the kernel stack and provides a pointer that refers to the current task structure, which can be used by kernel services.

The following code snippet shows struct thread_info for x86 architecture (kernel 3.10):

/* linux-3.10/arch/x86/include/asm/thread_info.h */

struct thread_info {
struct task_struct *task; /* main task structure */
struct exec_domain *exec_domain; /* execution domain */
__u32 flags; /* low level flags */
__u32 status; /* thread synchronous flags */
__u32 cpu; /* current CPU */
int preempt_count; /* 0 => preemptable, <0 => BUG */
mm_segment_t addr_limit;
struct restart_block restart_block;
void __user *sysenter_return;
#ifdef CONFIG_X86_32
unsigned long previous_esp; /* ESP of the previous stack in case of
nested (IRQ) stacks */
__u8 supervisor_stack[0];
#endif
unsigned int sig_on_uaccess_error:1;
unsigned int uaccess_err:1; /* uaccess failed */
};

With thread_info containing process-related information, apart from task structure, the kernel has multiple viewpoints to the current process structure: struct task_struct, an architecture-independent information block, and thread_info, an architecture-specific one. The following figure depicts thread_info and task_struct:

For architectures that engage thread_info, the current macro's implementation is modified to look into the top of kernel stack to obtain a reference to the current thread_info and through it the current task structure. The following code snippet shows the implementation of current for an x86-64 platform:

  #ifndef __ASM_GENERIC_CURRENT_H
#define __ASM_GENERIC_CURRENT_H
#include <linux/thread_info.h>

#define get_current() (current_thread_info()->task)
#define current get_current()

#endif /* __ASM_GENERIC_CURRENT_H */
/*
* how to get the current stack pointer in C
*/
register unsigned long current_stack_pointer asm ("sp");

/*
* how to get the thread information struct from C
*/
static inline struct thread_info *current_thread_info(void)
__attribute_const__;

static inline struct thread_info *current_thread_info(void)
{
return (struct thread_info *)
(current_stack_pointer & ~(THREAD_SIZE - 1));
}

As use of PER_CPU variables has increased in recent times, the process scheduler is tuned to cache crucial current process-related information in the PER_CPU area. This change enables quick access to current process data over looking up the kernel stack. The following code snippet shows the implementation of the current macro to fetch the current task data through the PER_CPU variable:

  #ifndef _ASM_X86_CURRENT_H
#define _ASM_X86_CURRENT_H

#include <linux/compiler.h>
#include <asm/percpu.h>

#ifndef __ASSEMBLY__
struct task_struct;

DECLARE_PER_CPU(struct task_struct *, current_task);

static __always_inline struct task_struct *get_current(void)
{
return this_cpu_read_stable(current_task);
}

#define current get_current()

#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_CURRENT_H */

The use of PER_CPU data led to a gradual reduction of information in thread_info. With thread_info shrinking in size, kernel developers are considering getting rid of thread_info altogether by moving it into the task structure. As this involves changes to low-level architecture code, it has only been implemented for the x86-64 architecture, with other architectures planned to follow. The following code snippet shows the current state of the thread_info structure with just one element:

/* linux-4.9.10/arch/x86/include/asm/thread_info.h */
struct thread_info {
unsigned long flags; /* low level flags */
};

The issue of stack overflow

Unlike user mode, the kernel mode stack lives in directly mapped memory. When a process invokes a kernel service, which may internally be deeply nested, chances are that it may overrun into immediate memory range. The worst part of it is the kernel will be oblivious to such occurrences. Kernel programmers usually engage various debug options to track stack usage and detect overruns, but these methods are not handy to prevent stack breaches on production systems. Conventional protection through the use of guard pages is also ruled out here (as it wastes an actual memory page).

Kernel programmers tend to follow coding standards--minimizing the use of local data, avoiding recursion, and avoiding deep nesting among others--to cut down the probability of a stack breach. However, implementation of feature-rich and deeply layered kernel subsystems may pose various design challenges and complications, especially with the storage subsystem where filesystems, storage drivers, and networking code can be stacked up in several layers, resulting in deeply nested function calls.

The Linux kernel community has been pondering over preventing such breaches for quite long, and toward that end, the decision was made to expand the kernel stack to 16kb (x86-64, since kernel 3.15). Expansion of the kernel stack might prevent some breaches, but at the cost of engaging much of the directly mapped kernel memory for the per-process kernel stack. However, for reliable functioning of the system, it is expected of the kernel to elegantly handle stack breaches when they show up on production systems.

With the 4.9 release, the kernel has come with a new system to set up virtually mapped kernel stacks. Since virtual addresses are currently in use to map even a directly mapped page, principally the kernel stack does not actually require physically contiguous pages. The kernel reserves a separate range of addresses for virtually mapped memory, and addresses from this range are allocated when a call to vmalloc() is made. This range of memory is referred as the vmalloc range. Primarily this range is used when programs require huge chunks of memory which are virtually contiguous but physically scattered. Using this, the kernel stack can now be allotted as individual pages, mapped to the vmalloc range. Virtual mapping also enables protection from overruns as a no-access guard page can be allocated with a page table entry (without wasting an actual page). Guard pages would prompt the kernel to pop an oops message on memory overrun and initiate a kill against overrunning process.

Virtually mapped kernel stacks with guard pages are currently available only for the x86-64 architecture (support for other architectures seemingly to follow). This can be enabled by choosing the HAVE_ARCH_VMAP_STACK or CONFIG_VMAP_STACK build-time options.

Process creation

During kernel boot, a kernel thread called init is spawned, which in turn is configured to initialize the first user-mode process (with the same name). The init (pid 1) process is then configured to carry out various initialization operations specified through configuration files, creating multiple processes. Every child process further created (which may in turn create its own child process(es)) are all descendants of the init process. Processes thus created end up in a tree-like structure or a single hierarchy model. The shell, which is one such process, becomes the interface for users to create user processes, when programs are called for execution.

Fork, vfork, exec, clone, wait and exit are the core kernel interfaces for the creation and control of new process. These operations are invoked through corresponding user-mode APIs.

fork()

Fork() is one of the core "Unix thread APIs" available across *nix systems since the inception of legacy Unix releases. Aptly named, it forks a new process from a running process. When fork() succeeds, the new process is created (referred to as child) by duplicating the caller's address space and task structure. On return from fork(), both caller (parent) and new process (child) resume executing instructions from the same code segment which was duplicated under copy-on-write. Fork() is perhaps the only API that enters kernel mode in the context of caller process, and on success returns to user mode in the context of both caller and child (new process).

Most resource entries of the parent's task structure such as memory descriptor, file descriptor table, signal descriptors, and scheduling attributes are inherited by the child, except for a few attributes such as memory locks, pending signals, active timers, and file record locks (for the full list of exceptions, refer to the fork(2) man page). A child process is assigned a unique pid and will refer to its parent's pid through the ppid field of its task structure; the child’s resource utilization and processor usage entries are reset to zero.

The parent process updates itself about the child’s state using the wait() system call and normally waits for the termination of the child process. Failing to call wait(), the child may terminate and be pushed into a zombie state.

Copy-on-write (COW)

Duplication of parent process to create a child needs cloning of the user mode address space (stack, data, code, and heap segments) and task structure of the parent for the child; this would result in execution overhead that leads to un-deterministic process-creation time. To make matters worse, this process of cloning would be rendered useless if neither parent nor child did not initiate any state-change operations on cloned resources.

As per COW, when a child is created, it is allocated a unique task structure with all resource entries (including page tables) referring to the parent's task structure, with read-only access for both parent and child. Resources are truly duplicated when either of the processes initiates a state change operation, hence the name copy-on-write (write in COW implies a state change). COW does bring effectiveness and optimization to the fore, by deferring the need for duplicating process data until write, and in cases where only read happens, it avoids it altogether. This on-demand copying also reduces the number of swap pages needed, cuts down the time spent on swapping, and might help reduce demand paging.

exec

At times creating a child process might not be useful, unless it runs a new program altogether: the exec family of calls serves precisely this purpose. exec replaces the existing program in a process with a new executable binary:

#include <unistd.h>
int execve(const char *filename, char *const argv[],
char *const envp[]);

The execve is the system call that executes the program binary file, passed as the first argument to it. The second and third arguments are null-terminated arrays of arguments and environment strings, to be passed to a new program as command-line arguments. This system call can also be invoked through various glibc (library) wrappers, which are found to be more convenient and flexible:

#include <unistd.h>
extern char **environ;
int execl(const char *path, const char *arg, ...);
int execlp(const char *file, const char *arg, ...);
int execle(const char *path, const char *arg,
..., char * const envp[]);
int execv(const char *path, char *constargv[]);
int execvp(const char *file, char *constargv[]);
int execvpe(const char *file, char *const argv[],
char *const envp[]);

Command-line user-interface programs such as shell use the exec interface to launch user-requested program binaries.

vfork()

Unlike fork(), vfork() creates a child process and blocks the parent, which means that the child runs as a single thread and does not allow concurrency; in other words, the parent process is temporarily suspended until the child exits or call exec(). The child shares the data of the parent.

Linux support for threads

The flow of execution in a process is referred to as a thread, which implies that every process will at least have one thread of execution. Multi-threaded means the existence of multiple flows of execution contexts in a process. With modern many-core architectures, multiple flows of execution in a process can be truly concurrent, achieving fair multitasking.

Threads are normally enumerated as pure user-level entities within a process that are scheduled for execution; they share parent's virtual address space and system resources. Each thread maintains its code, stack, and thread local storage. Threads are scheduled and managed by the thread library, which uses a structure referred to as a thread object to hold a unique thread identifier, for scheduling attributes and to save the thread context. User-level thread applications are generally lighter on memory, and are the preferred model of concurrency for event-driven applications. On the flip side, such user-level thread model is not suitable for parallel computing, since they are tied onto the same processor core to which their parent process is bound.

Linux doesn’t support user-level threads directly; it instead proposes an alternate API to enumerate a special process, called light weight process (LWP), that can share a set of configured resources such as dynamic memory allocations, global data, open files, signal handlers, and other extensive resources with the parent process. Each LWP is identified by a unique PID and task structure, and is treated by the kernel as an independent execution context. In Linux, the term thread invariably refers to LWP, since each thread initialized by the thread library (Pthreads) is enumerated as an LWP by the kernel.

clone()

clone() is a Linux-specific system call to create a new process; it is considered a generic version of the fork() system call, offering finer controls to customize its functionality through the flags argument:

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

It provides more than twenty different CLONE_* flags that control various aspects of the clone operation, including whether the parent and child process share resources such as virtual memory, open file descriptors, and signal dispositions. The child is created with the appropriate memory address (passed as the second argument) to be used as the stack (for storing the child's local data). The child process starts its execution with its start function (passed as the first argument to the clone call).

When a process attempts to create a thread through the pthread library, clone() is invoked with the following flags:

/*clone flags for creating threads*/
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID;

The clone() can also be used to create a regular child process that is normally spawned using fork() and vfork():

/* clone flags for forking child */
flags = SIGCHLD;
/* clone flags for vfork child */
flags = CLONE_VFORK | CLONE_VM | SIGCHLD;

Kernel threads

To augment the need for running background operations, the kernel spawns threads (similar to processes). These kernel threads are similar to regular processes, in that they are represented by a task structure and assigned a PID. Unlike user processes, they do not have any address space mapped, and run exclusively in kernel mode, which makes them non-interactive. Various kernel subsystems use kthreads to run periodic and asynchronous operations.

All kernel threads are descendants of kthreadd (pid 2), which is spawned by the kernel (pid 0) during boot. The kthreadd enumerates other kernel threads; it provides interface routines through which other kernel threads can be dynamically spawned at runtime by kernel services. Kernel threads can be viewed from the command line with the ps -ef command--they are shown in [square brackets]:

UID PID PPID C STIME TTY TIME CMD
root 1 0 0 22:43 ? 00:00:01 /sbin/init splash
root 2 0 0 22:43 ? 00:00:00 [kthreadd]
root 3 2 0 22:43 ? 00:00:00 [ksoftirqd/0]
root 4 2 0 22:43 ? 00:00:00 [kworker/0:0]
root 5 2 0 22:43 ? 00:00:00 [kworker/0:0H]
root 7 2 0 22:43 ? 00:00:01 [rcu_sched]
root 8 2 0 22:43 ? 00:00:00 [rcu_bh]
root 9 2 0 22:43 ? 00:00:00 [migration/0]
root 10 2 0 22:43 ? 00:00:00 [watchdog/0]
root 11 2 0 22:43 ? 00:00:00 [watchdog/1]
root 12 2 0 22:43 ? 00:00:00 [migration/1]
root 13 2 0 22:43 ? 00:00:00 [ksoftirqd/1]
root 15 2 0 22:43 ? 00:00:00 [kworker/1:0H]
root 16 2 0 22:43 ? 00:00:00 [watchdog/2]
root 17 2 0 22:43 ? 00:00:00 [migration/2]
root 18 2 0 22:43 ? 00:00:00 [ksoftirqd/2]
root 20 2 0 22:43 ? 00:00:00 [kworker/2:0H]
root 21 2 0 22:43 ? 00:00:00 [watchdog/3]
root 22 2 0 22:43 ? 00:00:00 [migration/3]
root 23 2 0 22:43 ? 00:00:00 [ksoftirqd/3]
root 25 2 0 22:43 ? 00:00:00 [kworker/3:0H]
root 26 2 0 22:43 ? 00:00:00 [kdevtmpfs]
/*kthreadd creation code (init/main.c) */
static noinline void __ref rest_init(void)
{
int pid;

rcu_scheduler_starting();
/*
* We need to spawn init first so that it obtains pid 1, however
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
kernel_thread(kernel_init, NULL, CLONE_FS);
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
complete(&kthreadd_done);

/*
* The boot idle thread must execute schedule()
* at least once to get things moving:
*/
init_idle_bootup_task(current);
schedule_preempt_disabled();
/* Call into cpu_idle with preempt disabled */
cpu_startup_entry(CPUHP_ONLINE);
}

The previous code shows the kernel boot routine rest_init() invoking the kernel_thread() routine with appropriate arguments to spawn both the kernel_init thread (which then goes on to start the user-mode init process) and kthreadd.

The kthread is a perpetually running thread that looks into a list called kthread_create_list for data on new kthreads to be created:

/*kthreadd routine(kthread.c) */
int kthreadd(void *unused)
{
struct task_struct *tsk = current;

/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
set_mems_allowed(node_states[N_MEMORY]);

current->flags |= PF_NOFREEZE;

for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (list_empty(&kthread_create_list))
schedule();
__set_current_state(TASK_RUNNING);

spin_lock(&kthread_create_lock);
while (!list_empty(&kthread_create_list)) {
struct kthread_create_info *create;

create = list_entry(kthread_create_list.next,
struct kthread_create_info, list);
list_del_init(&create->list);
spin_unlock(&kthread_create_lock);

create_kthread(create); /* creates kernel threads with attributes enqueued */

spin_lock(&kthread_create_lock);
}
spin_unlock(&kthread_create_lock);
}

return 0;
}

Kernel threads are created by invoking either kthread_create or through its wrapper kthread_run by passing appropriate arguments that define the kthreadd (start routine, ARG data to start routine, and name). The following code snippet shows kthread_create invoking kthread_create_on_node(), which by default creates threads on the current Numa node:

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data,
int node,
const char namefmt[], ...);

/**
* kthread_create - create a kthread on the current node
* @threadfn: the function to run in the thread
* @data: data pointer for @threadfn()
* @namefmt: printf-style format string for the thread name
* @...: arguments for @namefmt.
*
* This macro will create a kthread on the current node, leaving it in
* the stopped state. This is just a helper for
* kthread_create_on_node();
* see the documentation there for more details.
*/
#define kthread_create(threadfn, data, namefmt, arg...)
kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)


struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
void *data,
unsigned int cpu,
const char *namefmt);

/**
* kthread_run - create and wake a thread.
* @threadfn: the function to run until signal_pending(current).
* @data: data ptr for @threadfn.
* @namefmt: printf-style name for the thread.
*
* Description: Convenient wrapper for kthread_create() followed by
* wake_up_process(). Returns the kthread or ERR_PTR(-ENOMEM).
*/
#define kthread_run(threadfn, data, namefmt, ...)
({
struct task_struct *__k
= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__);
if (!IS_ERR(__k))
wake_up_process(__k);
__k;
})

kthread_create_on_node() instantiates details (received as arguments) of kthread to be created into a structure of type kthread_create_info and queues it at the tail of kthread_create_list. It then wakes up kthreadd and waits for thread creation to complete:

/* kernel/kthread.c */
static struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
void *data, int node,
const char namefmt[],
va_list args)
{
DECLARE_COMPLETION_ONSTACK(done);
struct task_struct *task;
struct kthread_create_info *create = kmalloc(sizeof(*create),
GFP_KERNEL);

if (!create)
return ERR_PTR(-ENOMEM);
create->threadfn = threadfn;
create->data = data;
create->node = node;
create->done = &done;

spin_lock(&kthread_create_lock);
list_add_tail(&create->list, &kthread_create_list);
spin_unlock(&kthread_create_lock);

wake_up_process(kthreadd_task);
/*
* Wait for completion in killable state, for I might be chosen by
* the OOM killer while kthreadd is trying to allocate memory for
* new kernel thread.
*/
if (unlikely(wait_for_completion_killable(&done))) {
/*
* If I was SIGKILLed before kthreadd (or new kernel thread)
* calls complete(), leave the cleanup of this structure to
* that thread.
*/
if (xchg(&create->done, NULL))
return ERR_PTR(-EINTR);
/*
* kthreadd (or new kernel thread) will call complete()
* shortly.
*/
wait_for_completion(&done); // wakeup on completion of thread creation.
}
...
...
...
}

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data, int node,
const char namefmt[],
...)
{
struct task_struct *task;
va_list args;

va_start(args, namefmt);
task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
va_end(args);

return task;
}

Recall that kthreadd invokes the create_thread() routine to start kernel threads as per data queued into the list. This routine creates the thread and signals completion:

/* kernel/kthread.c */
static void create_kthread(struct kthread_create_info *create)
{
int pid;

#ifdef CONFIG_NUMA
current->pref_node_fork = create->node;
#endif

/* We want our own signal handler (we take no signals by default). */
pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES |
SIGCHLD);

if (pid < 0) {
/* If user was SIGKILLed, I release the structure. */
struct completion *done = xchg(&create->done, NULL);

if (!done) {
kfree(create);
return;
}
create->result = ERR_PTR(pid);
complete(done); /* signal completion of thread creation */
}
}

do_fork() and copy_process()

All of the process/thread creation calls discussed so far invoke different system calls (except create_thread) to step into kernel mode. All of those system calls in turn converge into the common kernel function _do_fork(), which is invoked with distinct CLONE_* flags. do_fork() internally falls back on copy_process() to complete the task. The following figure sums up the call sequence for process creation:

/* kernel/fork.c */
/*
* Create a kernel thread.
*/ pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
(unsigned long)arg, NULL, NULL, 0);
}

/* sys_fork: create a child process by duplicating caller */
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
#else
/* cannot support in nommu mode */
return -EINVAL;
#endif
}

/* sys_vfork: create vfork child process */
SYSCALL_DEFINE0(vfork)
{
return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
0, NULL, NULL, 0);
}

/* sys_clone: create child process as per clone flags */

#ifdef __ARCH_WANT_SYS_CLONE
#ifdef CONFIG_CLONE_BACKWARDS
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
int __user *, parent_tidptr,
unsigned long, tls,
int __user *, child_tidptr)
#elif defined(CONFIG_CLONE_BACKWARDS2)
SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
#elif defined(CONFIG_CLONE_BACKWARDS3)
SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp,
int, stack_size,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
#else
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
#endif
{
return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
}
#endif

Process status and termination

During the lifetime of a process, it traverses through many states before it ultimately terminates. Users must have proper mechanisms to be updated with all that happens to a process during its lifetime. Linux provides a set of functions for this purpose.

wait

For processes and threads created by a parent, it might be functionally useful for the parent to know the execution status of the child process/thread. This can be achieved using the wait family of system calls:

#include <sys/types.h>
#include <sys/wait.h>
pid_t wait(int *status);
pid_t waitpid(pid_t pid, int *status, intoptions);
int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options)

These system calls update the calling process with the state change events of a child. The following state change events are notified:

  • Termination of child
  • Stopped by a signal
  • Resumed by a signal

In addition to reporting the status, these APIs allow the parent process to reap a terminated child. A process on termination is put into zombie state until the immediate parent engages the wait call to reap it.

exit

Every process must end. Process termination is done either by the process calling exit() or when the main function returns. A process may also be terminated abruptly on receiving a signal or exception that forces it to terminate, such as the KILL command, which sends a signal to kill the process, or when an exception is raised. Upon termination, the process is put into exit state until the immediate parent reaps it.

The exit calls the sys_exit system call, which internally calls the do_exit routine. The do_exit primarily performs the following tasks (do_exit sets many values and makes multiple calls to related kernel routines to complete its task):

  • Takes the exit code returned by the child to the parent.
  • Sets the PF_EXITING flag, indicating process exiting.
  • Cleans up and reclaims the resources held by the process. This includes releasing mm_struct, removal from the queue if it is waiting for an IPC semaphore, release of filesystem data and files, if any, and calling schedule() as the process is no longer executable.

After do_exit, the process remains in zombie state and the process descriptor is still intact for the parent to collect the status, after which the resources are reclaimed by the system.

Namespaces and cgroups

Users logged into a Linux system have a transparent view of various system entities such as global resources, processes, kernel, and users. For instance, a valid user can access PIDs of all running processes on the system (irrespective of the user to which they belong). Users can observe the presence of other users on the system, and they can run commands to view the state of global system global resources such as memory, filesystem mounts, and devices. Such operations are not deemed as intrusions or considered security breaches, as it is always guaranteed that one user/process can never intrude into other user/process.

However, such transparency is unwarranted on a few server platforms. For instance, consider cloud service providers offering PaaS (platform as a service). They offer an environment to host and deploy custom client applications. They manage runtime, storage, operating system, middleware, and networking services, leaving customers to manage their applications and data. PaaS services are used by various e-commerce, financial, online gaming, and other related enterprises.

For efficient and effective isolation and resource management for clients, PaaS service providers use various tools. They virtualize the system environment for each client to achieve security, reliability, and robustness. The Linux kernel provides low-level mechanisms in the form of cgroups and namespaces for building various lightweight tools that can virtualize the system environment. Docker is one such framework that builds on cgroups and namespaces.

Namespaces fundamentally are mechanisms to abstract, isolate, and limit the visibility that a group of processes has over various system entities such as process trees, network interfaces, user IDs, and filesystem mounts. Namespaces are categorized into several groups, which we will now see.

Mount namespaces

Traditionally, mount and unmount operations will change the filesystem view as seen by all processes in the system; in other words, there is one global mount namespace seen by all processes. The mount namespaces confine the set of filesystem mount points visible within a process namespace, enabling one process group in a mount namespace to have an exclusive view of the filesystem list compared to another process.

UTS namespaces

These enable isolating the system's host and domain name within a uts namespace. This makes initialization and configuration scripts able to be guided based on the respective namespaces.

IPC namespaces

These demarcate processes from using System V and POSIX message queues. This prevents one process from an ipc namespace accessing the resources of another.

PID namespaces

Traditionally, *nix kernels (including Linux) spawn the init process with PID 1 during system boot, which in turn starts other user-mode processes and is considered the root of the process tree (all the other processes start below this process in the tree). The PID namespace allows a process to spin off a new tree of processes under it with its own root process (PID 1 process). PID namespaces isolate process ID numbers, and allow duplication of PID numbers across different PID namespaces, which means that processes in different PID namespaces can have the same process ID. The process IDs within a PID namespace are unique, and are assigned sequentially starting with PID 1.

PID namespaces are used in containers (lightweight virtualization solution) to migrate a container with a process tree, onto a different host system without any changes to PIDs.

Network namespaces

This type of namespace provides abstraction and virtualization of network protocol services and interfaces. Each network namespace will have its own network device instances that can be configured with individual network addresses. Isolation is enabled for other network services: routing table, port number, and so on.

User namespaces

User namespaces allow a process to use unique user and group IDs within and outside a namespace. This means that a process can use privileged user and group IDs (zero) within a user namespace and continue with non-zero user and group IDs outside the namespace.

Cgroup namespaces

A cgroup namespace virtualizes the contents of the /proc/self/cgroup file. Processes inside a cgroup namespace are only able to view paths relative to their namespace root.

Control groups (cgroups)

Cgroups are kernel mechanisms to restrict and measure resource allocations to each process group. Using cgroups, you can allocate resources such as CPU time, network, and memory.

Similar to the process model in Linux, where each process is a child to a parent and relatively descends from the init process thus forming a single-tree like structure, cgroups are hierarchical, where child cgroups inherit the attributes of the parent, but what makes is different is that multiple cgroup hierarchies can exist within a single system, with each having distinct resource prerogatives.

Applying cgroups on namespaces results in isolation of processes into containers within a system, where resources are managed distinctly. Each container is a lightweight virtual machine, all of which run as individual entities and are oblivious of other entities within the same system.

The following are namespace APIs described in the Linux man page for namespaces:

clone(2)
The clone(2) system call creates a new process. If the flags argument of the call specifies one or more of the CLONE_NEW* flags listed below, then new namespaces are created for each flag, and the child process is made a member of those namespaces.(This system call also implements a number of features unrelated to namespaces.)

setns(2)
The setns(2) system call allows the calling process to join an existing namespace. The namespace to join is specified via a file descriptor that refers to one of the /proc/[pid]/ns files described below.

unshare(2)
The unshare(2) system call moves the calling process to a new namespace. If the flags argument of the call specifies one or more of the CLONE_NEW* flags listed below, then new namespaces are created for each flag, and the calling process is made a member of those namespaces. (This system call also implements a number of features unrelated to namespaces.)
Namespace Constant Isolates
Cgroup CLONE_NEWCGROUP Cgroup root directory
IPC CLONE_NEWIPC System V IPC, POSIX message queues
Network CLONE_NEWNET Network devices, stacks, ports, etc.
Mount CLONE_NEWNS Mount points
PID CLONE_NEWPID Process IDs
User CLONE_NEWUSER User and group IDs
UTS CLONE_NEWUTS Hostname and NIS domain name

Summary

We understood one of the principal abstractions of Linux called the process, and the whole ecosystem that facilitates this abstraction. The challenge now remains in running the scores of processes by providing fair CPU time. With many-core systems imposing a multitude of processes with diverse policies and priorities, the need for deterministic scheduling is paramount.

In our next chapter, we will delve into process scheduling, another critical aspect of process management, and comprehend how the Linux scheduler is designed to handle this diversity.

Left arrow icon Right arrow icon

Key benefits

  • ? Master the design, components, and structures of core kernel subsystems
  • ? Explore kernel programming interfaces and related algorithms under the hood
  • ? Completely updated material for the 4.12.10 kernel

Description

Mastering Linux Kernel Development looks at the Linux kernel, its internal arrangement and design, and various core subsystems, helping you to gain significant understanding of this open source marvel. You will look at how the Linux kernel, which possesses a kind of collective intelligence thanks to its scores of contributors, remains so elegant owing to its great design. This book also looks at all the key kernel code, core data structures, functions, and macros, giving you a comprehensive foundation of the implementation details of the kernel’s core services and mechanisms. You will also look at the Linux kernel as well-designed software, which gives us insights into software design in general that are easily scalable yet fundamentally strong and safe. By the end of this book, you will have considerable understanding of and appreciation for the Linux kernel.

Who is this book for?

If you are a kernel programmer with a knowledge of kernel APIs and are looking to build a comprehensive understanding, and eager to explore the implementation, of kernel subsystems, this book is for you. It sets out to unravel the underlying details of kernel APIs and data structures, piercing through the complex kernel layers and gives you the edge you need to take your skills to the next level.

What you will learn

  • • Comprehend processes and files—the core abstraction mechanisms of the Linux kernel that promote effective simplification and dynamism
  • • Decipher process scheduling and understand effective capacity utilization under general and real-time dispositions
  • • Simplify and learn more about process communication techniques through signals and IPC mechanisms
  • • Capture the rudiments of memory by grasping the key concepts and principles of physical and virtual memory management
  • • Take a sharp and precise look at all the key aspects of interrupt management and the clock subsystem
  • • Understand concurrent execution on SMP platforms through kernel synchronization and locking techniques
Estimated delivery fee Deliver to Ecuador

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 11, 2017
Length: 354 pages
Edition : 1st
Language : English
ISBN-13 : 9781785883057
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Ecuador

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Oct 11, 2017
Length: 354 pages
Edition : 1st
Language : English
ISBN-13 : 9781785883057
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 158.97
Mastering Embedded Linux Programming
$54.99
Linux Device Drivers Development
$48.99
Mastering Linux Kernel Development
$54.99
Total $ 158.97 Stars icon
Banner background image

Table of Contents

11 Chapters
Comprehending Processes, Address Space, and Threads Chevron down icon Chevron up icon
Deciphering the Process Scheduler Chevron down icon Chevron up icon
Signal Management Chevron down icon Chevron up icon
Memory Management and Allocators Chevron down icon Chevron up icon
Filesystems and File I/O Chevron down icon Chevron up icon
Interprocess Communication Chevron down icon Chevron up icon
Virtual Memory Management Chevron down icon Chevron up icon
Kernel Synchronization and Locking Chevron down icon Chevron up icon
Interrupts and Deferred Work Chevron down icon Chevron up icon
Clock and Time Management Chevron down icon Chevron up icon
Module Management Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.9
(10 Ratings)
5 star 30%
4 star 10%
3 star 20%
2 star 0%
1 star 40%
Filter icon Filter
Top Reviews

Filter reviews by




Sachin Ahirkar Nov 13, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The best thing about this book is the language used. Any person having some prior knowledge of Linux can use this book for exploring kernel in a deep way. Well explained topics with their structural information and by using diagrams so that it makes easy to get a picture of everything. Big thanks to the author.
Amazon Verified review Amazon
Sarfaraz Nov 29, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you understand a topic then you will be able to explain it to others in a simplest way, this is true for the book”Mastering Linux Kernel Development”. The most important thing is that this book covers kernel 4.12, so this is one of the books that cover latest version of the kernel. Similarly topics like synchronization, IPC, DFC, memory management are covered properly in this book and explained with examples. This book does not cover the concept of LDD, and can be understood from the fact that it is kernel book and not a device driver book. And concept presented in this book will help in designing the LDD more effectively. All together I am satisfied with this book and recommend other to buy one.This is one of the excellent book on linux kernel.
Amazon Verified review Amazon
Stergios Papadimitriou Aug 30, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Although I have excellent books for the Linux kernel, as the "Linux kernel Development" of Robert Love,"Understanding the Linux kernel" of Daniel P. Bovet and Marco Cesati,these refer to rather old Linux kernel versions.The present book covers the new Linux kernel version, and recent improvements as e.g. the Completely Fair Schedulerare described very well.The book succeeds in covering essential material for the Linux kernel, avoiding to drift away in not so important details,the level of presentation detail is about similar to the rather classic "Linux Kernel Development" book of Robert Love.Although the style and quality of presentation is very good, there are a few grammatical and syntactical errors.However, the technical English text of the book is already very readable, and all the explanations and descriptions that the author supplies are quite understandable.Personally, although I have long experience with the Linux kernel,I gained new insights and knowledge for the Linux kernel from the book,and I continue to improve my expertise by studying it.Therefore, I think the book will be very useful and instructive and for other readers,this is way I rated it finally with 5 stars!
Amazon Verified review Amazon
Amazon Customer Feb 12, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The book gives a nice insight of the linux kernel. The main edge of this book is that it covers 4.1 kernel which is almost the latest version
Amazon Verified review Amazon
Izzy Apr 16, 2018
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Kindle formatting is very poor. I hope printed version is better. Returned for a refund.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact [email protected] with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at [email protected] using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on [email protected] with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on [email protected] within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on [email protected] who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on [email protected] within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela