Brush up OS

This post aims to prepare myself for the upcoming CS380L Advanced Operating Systems offered by Christopher J. Rossbach. The questions are actually from his HW1 aka. "Swapping in the state from undergraduate OS".

Brush up

Brush up

Definitions

Define the following terms:

Internal and external fragmentation. Which of them can occur in a paging system? A system with pure segmentation?

Fragmentation happens when we talk about memory allocation (e.g., user requests memory through malloc(); OS manages physical memory when using segmentation to implement virtual memory). There are two types of fragmentation:

internal fragmentation: unused memory within a unit of allocation: if an allocator hands out chunks of memory bigger than that requested, any unasked for (and thus unused) space in such a chunk is considered internal fragmentation (because the waste occurs inside the allocated unit). Metaphorically speaking, internal fragmentation happens when a party of three at a table for four.
external fragmentation: unused memory between units of allocation: the free space gets chopped into little pieces of different sizes and is thus fragmented; subsequent requests may fail because there is no single contiguous space that can satisfy the request, even though the total amount of free space exceeds the size of the request. Metaphorically speaking, external fragmentation happens when two fixed tables for two but a party of four.

In a paging system, since the free space is divided into fixed-size units (i.e., pages), no external fragmentation can happen: a page request will always get satisified as all page size are equal; we can simply return one of them. However, internal fragmentation can still happen. For example, if the page size is 4KB and we request 3KB memory, then there will be 1KB unused memory within the returned page.

In a system with pure segmentation, only extenral fragmentations can happen. Since segments are variable-sized units, external fragmentation cannot be avoided. However, the system can give out exact size of requested memory (no paging exists), then there is no internal fragmentation.

Note

OSTEP Chapter 16 and Chapter 17 offer more on segmentation, which causes fragmentation, and free-space management that handles external fragmentation issue.

Translation look-aside buffer (TLB).

TLB is part of the chip's memory-management unit (MMU). It serves as a hardware cache of popular virtual-to-physical address translations (i.e., virtual page number to physical frame number translations)¹.

Interrupt.

Interrupts are essentially requests for attention. In the same way, peripherals in a computer system can request the attention of the processor. The event that makes a microprocessor stop executing one routine to perform some other routine to service a request, is called an INTERRUPT². In other words, an interrupt is an electrical signal that causes the processor to stop the current execution and begin to execute interrupt handler code. Interrupts are asynchronous excpetions. Asynchronous here means "not related to instruction that just executed"³.

Distributed shared memory.

Distributed shared memory is a memory architecture that a cluster of nodes share the same logical address space but physical memories are from each node in the cluster and connected via the network (i.e., a form of memory architecture where physically separated memories can be addressed as one logically shared address space) ⁴

Stateful and stateless servers. Also, list two file systems, one that is stateful and one that is stateless, and explain how having or not having state affects file service.

Stateless server means the server does not keep track of anything about what is happening at each client (e.g., server does not maintain which client open what file on its side). Stateful server is the opposite of the stateless server. For example, server keeps track of what client accesses what files so that when the server's file copy is modified (e.g., by some other clients who also access the same file), server can inform clients to invalidate their cache of the same file.

A classic stateless file system is NFS with NFSv2 protocol (NFSv4 protocol makes NFS become a stateful server as well). AFS is a stateful file system.

Not having state makes crash recovery on the server side fast: server doesn't need to recover any client-side information (thanks to stateless) and on server failure, clients can simply retry the request. However, with stateless, clients need to constantly contact server to validate their cache, which impose extra load to server. Furthermore, clients need to pass extra information (e.g., file handle) when perform system calls, which impose extra overhead to clients. Having state complicates the crash recovery: server needs to initiate cache validation to the clients; clients also need to send out heartbeat message to server. However, with state, client doesn't need to constantly contact to validate its cache (wait for server's callback).

Swapping.

Since all pages cannot reside in the physical memory, to support large address space, we use part of disk (e.g., swap space) by swapping pages in and out between physical memory and swap space ⁵.

Inverted page table.

Inverted page table (along with multi-level page tables) aims to save memory space taken by the page tables. Instead of having one page table per process, we keep a single page table that has an entry for each physical page of the system. The entry tells us which process is using this page, and which virtual page of that process maps to this physical page ⁶. PowerPC uses this technique.

Disk sector, track, and cylinder.

A disk sector refers to 512-byte block on the disk (aside note, drive manufacturers guarantee that a single 512-byte write is atomic). Sectors of disk are organized as a set of concentric circles; each concentric circle of sectors is called a track ⁷. A cylinder is a set of tracks on different surfaces of a hard dirve that are the same distance from the center of the drive; it is called a cylinder because of its clear resemblance to the so-called geometrical shape ⁸.

Thrashing.

The situation when memory demands of the set of running processes exceeds the available physical memory. System performace degrades due to the fact that system CPU time is dominated by the activity of swaping pages in and out between physical memory and swap space on disk ⁹.

Short answer

Provide a short answer to each of the following questions:

Give a reason why virtual memory address translation is useful even if the total size of virtual memory (summed over all programs) is guaranteed to be smaller than physical memory.

Virtual memory address translation is useful in providing isolation (safety) across different processes: OS can make sure that one proacess cannot access part of physical memory that is owned by the other process. In addition, address translation helps OS to better manage the underlying physical memory: every process thinks address space starts at 0, which may not be true on the physical memory; for better memory management, OS may relocate the address space to some other physical address. Address translation enables OS to perform such memory management while makes the relocation be transparent to process ¹⁰.

Compare and contrast access control lists and capabilities.

In an access control list model of security, the authorities are bounded to the objects being secured (e.g., file A can be r/w by user Alice). By contrast, in the capabilities model, the authorities are bound to objects seeking access ¹¹ (e.g., UNIX file descriptors are capabilities: authorities are bounded to file descriptors, which reflect rights on objects like files or sockets ¹²).

As one can see, with the ACL model, each object has just one list, while with the capability model, each object has a whole set of different, separable capabilities. For example, in capability model, a process has different capabilities depending on the objects it tries to access.

This page has a nice example: file "aaa" has ACL as "Alice:R/W, Bob:R, Carol:R". File "aaa" is the object being secured. "Alice" has capabilities "aaa:R/W, bbb:R, ccc:R". "Alice" as a user is the object seeking access.

The length of the time slice is a parameter to round robin CPU scheduling. What is the main problem that occurs if this length is too long? Too short?

There are two important metrics in scheduling: turnaround time and response time. turnaround time of a job is defined as \(T_{completion} - T_{arrival}\) (the time at which the job completes minus the time at which the job arrived in the system). response time is defined as \(T_{firstrun} - T_{arrival}\) (the time from when the job arrives in a system to the first time it is scheduled). Round robin (RR) CPU scheduling is optimized towards response time. If time slice is too long, then the response time is worse. By contrast, if time slice is too short, the cost of context switching will dominate overall performance, which means less work is done during the time slice ¹³.

List an advantage and a disadvantage to increasing the virtual memory page size.

The advantages of increasing the virtual memory page size are: reduces page table size ¹⁴, improves TLB hit rate ¹⁵. The disadvantage is the internal fragmentation (i.e., waste within each page; the waster is internal to the unit of allocation) ⁶.

Virtual memory addressing

Suppose we have a machine that uses a three-level page table system. A 32-bit virtual address is divided into four fields of widths a, b, c, and d bits from left to right. The first three are indices into the three levels of page table; the fourth, d, is the offset. Express the number of virtual pages available as a function of a, b, c, and d. Give one advantage and one disadvantage to multi-level page tables.

Since we have a three-level page table system, then \(a\) corresponds to Page Directory Index (PDI) 0, \(b\) corresponds to PDI 1, \(c\) corresponds to Page Table Index. Each page contains \(2^c\) Page Table Entries (PTE). Page Directory 1 needs one entry per page and contains \(2^b\) entries. Similarly, Page Directory 0 needs one entry per page, which contains one Page Directory 1 and there are \(2^a\) entries. Thus, there are \(2^a\) Page Directory 1, and each Page Directory 1 can point to \(2^b\) pages, and each page contains \(2^c\) PTEs, and each PTE corresponds to one virtual page. Therefore, there are \(2^{a+b+c}\) virutal pages. Another way of cacluation is that the physical frame size is \(2^d\), which is equal to page size. Since the address space is 32 bit, there are \(\frac{2^{32}}{2^d}\) virtual pages ¹⁶.

The advantages of multi-level page tables are that:

The multi-level table only allocates page-table space in proportion to the amount of address space we are using (more compact to support sparse page table)
each portion of page table fits into a page makes memory management easier: we don't have to find contiguous physical memory chunk that can contain the linear page table; we can place page-table pages whereever we want in the physical memory.

The disadvantage is that additional memory accesses to look up a valid translation (e.g., for 3-level page table system, we have three additional memory accesses: access Page Directory Entry 0 (PDE0), access PDE1, access PTE)¹⁷.

Page replacement

Suppose a machine with 4 physical pages starts running a program (in other words, the physical pages are initially empty). The program references the sequence of virtual pages as follows: A B C D E D C B A E D C B A C E For each of the following paging algorithms, replicate the reference pattern and underline each reference that causes a page fault (or make references that cause a page fault uppercase, and those that don’t lowercase):

LRU.

LRU stands for Least-Recently-Used, which is one of the page replacement policies (in practice, we use clock algorithm to approximate LRU ¹⁸ to avoid heavy overhead). The trace of the memory references with LRU policy shown below:

Access	Hit/Miss	Evict	Resulting Cache State (LRU at front, MRU at tail)
A	Miss		A
B	Miss		A,B
C	Miss		A,B,C
D	Miss		A,B,C,D
E	Miss	A	B,C,D,E
D	Hit		B,C,E,D
C	Hit		B,E,D,C
B	Hit		E,D,C,B
A	Miss	E	D,C,B,A
E	Miss	D	C,B,A,E
D	Miss	C	B,A,E,D
C	Miss	B	A,E,D,C
B	Miss	A	E,D,C,B
A	Miss	E	D,C,B,A
C	Hit		D,B,A,C
E	Miss	D	B,A,C,E

From the table, we have A,B,C,D,E,d,c,b,A,E,D,C,B,A,c,E.

FIFO.

FIFO means first-in, first-out: pages are placed in a queue as the order they are brought into physical memory; when replacement occurs, the first-in page is evicted. The similar trace of memory references with FIFO policy shown below:

Access	Hit/Miss	Evict	Resulting Cache State
A	Miss		A
B	Miss		A,B
C	Miss		A,B,C
D	Miss		A,B,C,D
E	Miss	A	B,C,D,E
D	Hit		B,C,D,E
C	Hit		B,C,D,E
B	Hit		B,C,D,E
A	Miss	B	C,D,E,A
E	Hit		C,D,E,A
D	Hit		C,D,E,A
C	Hit		C,D,E,A
B	Miss	C	D,E,A,B
A	Hit		D,E,A,B
C	Miss	D	E,A,B,C
E	Hit		E,A,B,C

Thus, we have A,B,C,D,E,d,c,b,A,e,d,c,B,a,C,e.

Optimum.

Optimal policy means we want to replace the page that will be accessed furthest in the future and doing so will result in the optimal policy with fewest possible cache misses. The resulting trace with optimal policy follows:

Access	Hit/Miss	Evict	Resulting Cache State
A	Miss		A
B	Miss		A,B
C	Miss		A,B,C
D	Miss		A,B,C,D
E	Miss	A	B,C,D,E
D	Hit		B,C,D,E
C	Hit		B,C,D,E
B	Hit		B,C,D,E
A	Miss	B	C,D,E,A
E	Hit		C,D,E,A
D	Hit		C,D,E,A
C	Hit		C,D,E,A
B	Miss	D	C,E,A,B
A	Hit		C,E,A,B
C	Hit		C,E,A,B
E	Hit		C,E,A,B

We have A,B,C,D,E,d,c,b,A,e,d,c,B,a,c,e.

Multiprocessing

Suppose you have a large source program containing m files that you want to compile. You have a cluster of n “shared-nothing” workstations, where n > m, on which you may compile your files. At best you will get an m-fold speedup compared to a single processor. List at least three reasons as to why the actual speedup might be less than this.

Interconnection overhead. To perform distributed compilation, there are network communication cost and coordination cost about assigning files to different workstations, and communication across machines during the compilation (e.g., A file on one workstation may have dependencies of the files assigned to other machines. To make the file into object code file, file copies from other machines are needed).
Computation graph may have dependency that is not parallelizable (e.g., linking).
Resources limitation. We have a cluster of n workstations but we may not have fully-usage of the underlying resources (i.e., our jobs may be subject to cluster coordinator scheduling).
File size are unequal. Suppose we have 5 files to compile: first 4 need 1s compilation time while the last one needs 10 minutes. Then, whether we perform distributed compilation or not doesn't make difference: the last file will make become bottleneck and we cannot get 5-fold speedup.

Achieving fast file reads

(a) Give at least three strategies that a file system can employ to reduce the time a program spends waiting for data reads to complete. (b) For each strategy you listed, describe a read pattern for which the strategy would do well, and one for which the strategy would do poorly

Use cache (e.g., buffer cache, page cache of the virtual memory) to pre-read the requested data block and several subsequent data blocks at the same time on open(). Then, data can be read from cache in memory instead of disk to avoid expensive disk I/O.
- Sequential read performs well
- Random read performs poorly
We can compress the data when we store them at the first place. By reducing the size of file, we can reduce the amount of disk I/O done for reading the file and spend more CPU time to uncompress the data, which is naturally faster than doing disk I/O instead.
- Big read performs well
- Small read performs poorly
We can use RAID Level 0 (i.e., striping) to spread the data block across the disks in a round-robin fashion. When we read the data, we can utilize multiple disks in parallel.
- Sequential read performs well
- Random read performs poorly
We can place file's inode close to file's data blocks so that during data read, we can reduce seek and rotational delay costs of HDD.
- Sequential read performs well
- Random read performs poorly
Prefetching: we can try to predict what data/files the user will read next and buffer the file content into memory.
- Sequential read performs well
- Random read performs poorly
Build index to the most visited files (i.e., direct pointer to the files) so that we don't have to traverse internal data structure of file system (i.e., similar to TLB).
- Read hot data performs well
- Read cold data performs poorly

Synchronization

Your OS has a set of queues, each of which is protected by a lock. To enqueue or dequeue an item, a thread must hold the lock associated with the queue. You need to implement an atomic transfer routine that dequeues an item from one queue and enqueues it on another. The transfer must appear to occur atomically. This is your first attempt:

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    queue1->lock.Acquire();
    thing = queue1->Dequeue();
    if(thing != NULL) {
        queue2->lock.Acquire();
        queue2->Enqueue(thing);
        queue2->lock.Release();
    }
    queue1->lock.Release();
}

You may assume that queue1 and queue2 never refer to the same queue. Also, assume that you have a function Queue::Address() which takes a queue and returns, as an unsigned integer, its address.

(a) Explain how using this implementation of transfer() can lead to deadlock

One possible scenario of deadlock using transfer() is illustrated below. Note that queue1 and queue2 in transfer() are replaced with queues that they are pointing to (time increases with the row's number):

Thread1 (`transfer(q1,q2)`)	Thread2 (`transfer(q2,q1)`)
`q1->lock.Acquire();`
`thing = q1->Dequeue();`
`if (thing != NULL)`
interrupt: switch to Thread2
	`q2->lock.Acquire();`
	`thing = q2->Dequeue();`
	`if (thing != NULL)`
	`q1->lock.Acquire(); // block & wait on q1's lock`
	interrupt: switch to Thread1
`q2->lock.Acquire(); // block & wait on q2's lock`

(b) Write a modified version of transfer() that avoids deadlock and does the transfer atomically.

pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    pthread_mutex_lock(&m);
    queue1->lock.Acquire();
    thing = queue1->Dequeue();
    if(thing != NULL) {
        queue2->lock.Acquire();
        queue2->Enqueue(thing);
        queue2->lock.Release();
    }
    queue1->lock.Release();
    pthread_mutex_unlock(&m);
}

Alternatively, we can do:

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    pthread_mutex_lock(&m);
    while (queue2->lock.available())
    {
      queue2->lock.Acquire();
    }
    queue1->lock.Acquire();
    pthread_mutex_unlock(&m);
    thing = queue1->Dequeue();
    if(thing != NULL) {
        queue2->Enqueue(thing);
        queue2->lock.Release();
    }
    queue1->lock.Release();
}

(c) If the transfer does not need to be atomic, how might you change your solution to achieve a higher degree of concurrency? Justify why your modification increases concurrency.

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    queue1->lock1.Acquire();  /* lock1 for Dequeue() */
    thing = queue1->Dequeue();
    queue1->lock1.Release(); 
    if(thing != NULL) {
        queue2->lock2.Acquire(); /* lock2 for Enqueue() */
        queue2->Enqueue(thing);
        queue2->lock2.Release();
    }
}

It's more fine-grained than the previous approach because compared to previous approach which only one thread can touch one queue, two threads can touch on the same queue: one for enqueue and one for dequeue.

Alternatively, we can do:

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    queue1->lock.Acquire();
    thing = queue1->Dequeue();
    queue1->lock.Release();    
    if(thing != NULL) {
        queue2->lock.Acquire();
        queue2->Enqueue(thing);
        queue2->lock.Release();
    }
}

The length of time we hold lock is shorter (i.e., at any time, the function is holding only one lock).

Networking

You are developing a network protocol for the reliable delivery of fixed-sized messages over unreliable networks. You are using a sequence number in each message to allow the receiver to eliminate duplicates, but you still have three design alternatives to consider. The design alternatives are: (1) the sender must receive an acknowledgement for the previously sent message before it can send the next message in the sequence, (2) the sender can transmit up to n unacknowledged messages, but the receiver will discard any messages that are received out of sequence (in other words, it will only acknowledge a message if it is received in sequence), and (3) the sender can transmit up to n unacknowledged messages, and the receiver will acknowledge each on receipt, even if they arrive out of order. For each alternative, answer the following questions

(a) Explain what state the receiver must keep around to implement each of the three alternatives (remember, the receiver must be able to detect and discard duplicates).

For 1), we keep track of expected sequence # (or last received sequence #)
For 2), we keep track of expected sequence # (or last received sequence #)
For 3), we keep track of a buffer with size \(n\) and make sure there is no duplicates by sorting them. If there is a duplicate, the ack will be sent with the missing sequence # back to sender. The receiver only increments the lower bound sequence # of buffer when \(n\) messages within the buffer satisfying the requirement (invariant: all the sequence number smaller than lower bound is well-ordered and no missing).

(b) Suppose two machines are communicating across a bi-directional network link with fixed-size messages. Acknowledgement packets are also fixed-size. One machine, the sender, is transmitting a very large file to the other machine, the receiver. That is, the sender is transmitting a large number of messages that make up this file. Discuss briefly how each alternative’s data bandwidth is expected to vary given different underlying network characteristics.

If there is no loss of the packets during the transit, (3) has the highest data bandwidth as there are maximum amount of packets in the flight and receiver will acknowledge even the packets are out of order. By contrast, (1) has the lowest data bandwidth due to sender can only send out a packet when an ack is received. If loss is seldom, (3) has highest data bandwidth. If there is loss, the more servere the loss, the more bandwidth (1) can achieve compared to others. (3) is worst in this case because (3) performs lots of useless work (send out n packets but get dropped anyway; leads to congestion).

When latency is small, (3) is the most effective; (1) is the worst. When latency is large, (1) is the most effective; (2) is the worst because of possible congestions and the order of packets is very likely be out-of-order, and then discarded by receiver, and sender will need to resend again.

When everything is good, (2) and (3) are similar.

Serving multiple clients

There are two main approaches to organizing a server daemon, such as a web server: a) Create a new kernel thread for each client (for each web browser connection); b) Use a single process responding to all clients, usually based on the select() system call. Compare and contrast these two approaches.

Multi-threaded architecture ¹⁹:

A pool of worker threads handle incoming requests
One worker thread usually corresponds to one connection
A dispatcher thread blocks socket on connection and once establised, the connection is passed to a queue of connections where worker threads can take connections from the queue and handle the request
Queue of connections is bounded: awaiting connections to be handled is limited; extra connections will be rejected
Predicatable latency and prevents total overload
Can utilize multiple CPU cores (*)
A synchronous, blocking I/O architecture

Event-based architecture:

asyncronous, non-blocking
one process (thread) handles multiple connections
Events are queued and will be dequeued by the event loop (single thread)
Each event has corresponding event handler
Usually the event is handled in a cascade of callbacks
Depends on the implementation, the event can be handled in the same thread as the event loop (i.e., event loop will be blocked when handle the event) or a pool of threads will be used for event handler
Compared to multi-threaded architecture, number of threads used is reduced (consequently, get rid of excessive context switching overhead, no thread stack for each connection); event-based architecture scales with increasing throughput; latency of requests increases linearly when overload

The key difference between these two architectures is who do the scheduling: in multi-threaded architecutre, OS performs scheduling by performing context-switch of different worker threads. However, for event-based architecture, event loop thread essentially works as a scheduler: multiplexing multiple connections to a single flow of execution. In addition, for event-based architecture, we need to implement preemptive interrupt on event handler thread pool as well.

In addition, the kernel threads approach can have easily lead to system crash due to all the work is in kernel-space. In addition, kernel thread has higher context-switch overhead compared to threads within single process of the second approach ¹². However, kernel threads can scale up with number of CPU cores while single process approach cannot benefit from multi-core.

taken from OSTEP Chapter 19 ↩
taken from Phil Storrs PC Hardware book ↩
Alison's CS439 Lec 02 and CSAPP (2nd edition) 8.1.2 ↩
See Protic et al., 1996 and Wikipedia ↩
See OSTEP Chapter 21 ↩
taken from OSTEP Chapter 20. More details found in Alison's CS439 Lec 12 "Virtual Memory: Mechanisms" ↩↩
See OSTEP Chapter 37 ↩
See OSTEP Figure in Chapter 41.3 ↩
The term is originally defined in Denning, 1968. But I find this article's explanation is more intuitive. ↩
See OSTEP Chapter 15 for more details. ↩
See Capabilities and ACLs ↩
See Shapiro et al., 1999 and Watson et al., 2010 ↩↩
See OSTEP Chapter 7. ↩
One cacluation example can be found from OSTEP Chapter 20.1. ↩
See OSTEP Chapter 19.2 example. ↩
OSTEP Chapter 20, Chapter 18, and this page may help with cacluation. ↩
See OSTEP Chapter Figure 20.6. ↩
See OSTEP Chapter 22 ↩
See Concurrent Programming for Scalable Web Architectures ↩
See CSE451 ↩

Peter J. Denning. Thrashing: its causes and prevention. In Proceedings of the December 9-11, 1968, Fall Joint Computer Conference, Part I, AFIPS '68 (Fall, part I), 915–922. New York, NY, USA, 1968. ACM. URL: http://doi.acm.org.ezproxy.lib.utexas.edu/10.1145/1476589.1476705, doi:10.1145/1476589.1476705. ↩

Jelica Protic, Milo Tomasevic, and Veljko Milutinovic. Distributed shared memory: concepts and systems. IEEE Parallel Distrib. Technol., 4(2):63–79, June 1996. URL: http://dx.doi.org/10.1109/88.494605, doi:10.1109/88.494605. ↩

Jonathan S Shapiro, Jonathan M Smith, and David J Farber. EROS: a fast capability system. Volume 33. ACM, 1999. ↩

Robert NM Watson, Jonathan Anderson, Ben Laurie, and Kris Kennaway. Capsicum: practical capabilities for unix. In USENIX Security Symposium, volume 46, 2. 2010. ↩