Zeyuan Hu's page

Looking for Collaborators

2024-03-24T20:11:00+08:00

Motivation
What I'm looking for

Motivation

Quite often, I stumble upon personal websites from academia with sentence like "I'm always in search of collaborators". I figured I may try to write a sentence like that somewhere on my page as well. I think the most important reason for me is to have a chance to share experience and make friends along the way ¹; to people in academia, what's more fun than talking about your research and working on some exciting ideas together? One question I have whenever I read a sentence like above is that: how many people are actually reaching out? Does it work at all? I'm curious to find out myself.

What I'm looking for

Collaborator may look too serious. It's great that we can have something to work on together but I certainly don't want to send those people that may not know what collaborator mean or too afraid to reach out to see what might happen. So, I plan to go extra miles here to list out what I mean by the sentence "I'm always in search of collaborators".

I'm looking for someone who I can talk about research. The topic can range from metadata of the research: research process, research story, etc, to concrete substance: actual research problems, details of some papers or technical matters. For metadata, I don't have any preference; we can talk about anything you want to share or anything you want me to share. Effectively, we're in the same therapy or support group. For the concrete substance, I do have preference because my knowledge is quite limited to some areas; sure, I'm okay with general computer science background but, to get into depth, I'm more comfortable talking about database (query processing, query optimization) and mechanism design (especially without money).
If you are serious about collaborating on some research project, I hope you mean it. I usually work on at maximum two projects: one project is my ongoing project at school and the other project is the one we work together. Once I decide to investigate something, I'm usually very serious about it: I spend lots of time to think and investigate until I reach some form of a closure (hopefully, my other posts make the point). I'm expecting you are too. It's okay in general that you don't have anything concrete to work on as long as you're being accountable and truly want to do something together.
Please be patient. I hope you get that I'm still learning to do research. If you're looking for someone that can guide you and crank out a paper in a short period of time, I'm not your guy. If you already have something and for some reason want my help, please be patient as I'm not a quick reaction people; I need time to understand the problem thoroughly. But, don't worry, you won't feel like flying in the dark because I like to communicate very much; that's the key point of me looking for collaborator. I used to lead a reading group for undergraduates to provide them some research experience. I quickly find that they usually lack of patience because the project is not going anywhere and the feedback loop is less obvious than finishing a homework with clearly marked grades. So, if you feel lost, don't worry; we'll learn something along the way. Please let me know explicitly when you want to stop; being accountable, remember?
You can check out my hobbies under the about page if you think those are good indicators to suggest whether we are a good match.

If the idea looks promsing to you, please send me an email or send me a request on discord (my username is .zazk). We can move from there.

You can think it as a way to overcome loneliness in graduate school. ↩

Release of sphinxcontrib-pseudocode

2021-12-21T02:11:00+08:00

To celebrate the release of sphinxcontrib-pseudocode, the first sphinx-doc extension I have ever written, I document some implementation details behind this extension.

Introduction
Background
Implementation Details
Development Environment Setup
Conclusion

Introduction

sphinxcontrib-pseudocode allows one to write $\LaTeX$ algorithm (using algpseudocode package, in specific) within sphinx-doc. The horsepower of doing so comes from pseudocode.js. The extension itself simply streamlines all the setup steps required by pseudocode.js and allows user to directly type $\LaTeX$ algorithms within a sphinx-doc directive pcode, which is introduced by this extension. The pseudocode.js specific html block and javascript (js) rendering code are automatically handled by the extension. The rest of html generation steps, as usual, are done by sphinx-doc rendering engine.

There aren't many choices when coming to write algorithms within sphinx-doc. To my best knowledge, I only able to find sphinx-proof that offers this cability. However, I think sphinx-proof is suitable to write simple algorithms on very high level (e.g., I haven't able to find a way to write explicitly for loop quite nicely using this extension). Thus, I think being able to write algorithm using $\LaTeX$ is a nice add-on in this niche territory.

Background

I built this extension based on sphinxcontrib-mermaid. Thus, a plenty of boilerplate steps have been done (e.g., properly set up python package, etc). The major development effort is spent on modification to python files under sphinxcontrib directory; in particular, pseudocode.py.

To understand implementation details behind sphinxcontrib-pseudocode, we first need to understand how to use pseudocode.js. README of pseudocode.js provides detailed steps. Here, I highlight some of the steps that are related to the implementation detailed later:

After including the necessary js dependencies, we can embed our $\LaTeX$ algorithm within a html block
```
  <pre id="quicksort" style="display:hidden;">
  
  </pre>
```
To render the algorithm, we need to specify id of the algorithm html block in the following js render code. In this example, the id is quicksort. We then need to put this render code at the end of the document. As shown in pseudocode.js example, this piece of code sits right before we close <body> tag. If there are multiple algorithm blocks within a page, we need to assign each block with an unique id and add corresponding renderElement function call as well in the above <script> block.
```
  <script>
  pseudocode.renderElement(document.getElementById("quicksort"));
  </script>
```
pseudocode.js can take options that allow user to tweak rendering behavior. As an example, we can pass in lineNumber to a js render function to indicate we want to line numbering the algorithm that the js render function associated with:
```
  pseudocode.renderElement(document.getElementById("quicksort"),
                           { lineNumber: true });
```

The usage of pseudocode.js exposes a few requirmenets we need to handle in our extension:

We need to assign unique id to each algorithm block.
Each algorithm block has to be associated with a renderElement function, which itself may take extra options that are supported by pseudocode.js (e.g., lineNumber).
the <script> block that contains render functions needs to be placed at the bottom of html page.

Implementation Details

Now, we start to detail the implementation idea behind sphinxcontrib-pseudocode and introduces a few concepts around writing extenstion to sphinx-doc rendering engine.

Generating Unique IDs

To generate unique id to each algorithm block, I use uuid.uuid4().

Note

add_node() is used to add a new Docutils node class to the sphinx-doc build system. During this function call, we specify the visitor function that can be used to render html code during the Phase 4 (Writing) of the sphinx-doc build phases ¹.

id created from uuid are stored in Docutils node, which can then be referenced during sphinx-doc build phases. In specific, we can reference id attribute of the node to fill the id into the html template so that sphinx-doc produces html algorithm block during the html generation.

Handling JS scripts and code

There are two parts that the extension needs to deal with js:

We need to install pseudocode.js and its dependencies at the very beginning of the phase when Sphinx-doc needs to convert the parsed document (i.e., a tree of Docutils nodes) into an output format (i.e., html). This is because we need to let Sphinx-doc includes those necessary js scripts at the beginning of the html document being produced.
We also want to create corresponding pseudocode.js render function calls so that all the algorithm html blocks can be rendered properly. That means:
- we need to fill those function calls with ids we just created
- create the exact same number of render functions as the number of algorithm blocks
- able to pass in specific options if a pcode directive contains specific options
- put those js functions at the end of html document

Let's talk about each part in more details. The core concepts from sphinx-doc to our implementation are events and associated events handlers. Essentially, sphinx-doc will emit different events during its build phase and our extension can register associated event handlers to perform certain tasks when certain events happened.

The first important event is builder-inited. This is the time when we need to supply pseudocode.js and its dependencies. Builder is the object that takes care of converting the parsed document into an output format. In specific, we use add_js_file to add pseudocode.js. We also add css files as well ².

Note

When builder-inited event happens, the document has been parsed yet. This means sphinx-doc doesn't encounter our pcode directives yet. Thus, the number of algorithm blocks are unknown. Thus, we cannot generate render functions at this time.

There is one limitation about sphinx-doc that shapes how we handle part 2. That is, to my best knowledge, sphinx-doc doesn't provide a way to insert <script> right before <body> tag close (as stated in pseudocode.js usage guide) during the build. This means, all js related elements (depenendent scripts, js function calls) have to appear at the very beginning of the document. In my experiment with pseudocode.js, directly following example usage (i.e., <script> block with renderElement calls) cannot render the algorithm blocks successfully. I'm no javascript expert but as suggested by multiple sources (1, 2), I can use DOMContentLoaded to make pseudocode.js render functions work properly after the whole html document is loaded. As a result, I can put all the js elements at the beginning of the html document.

My implementation for part 2 follows sphinxcontrib-katex closely. The very first thing to do is to gather all ids we have created so that we know how many renderElement function calls we need. We do so when doctree-resolved event happens. When this event happens, the parse tree of Docutils node (i.e., doctree) has been created. Thus, we can access all pcode nodes. I follow TODO extension example to traverse the doctree and collect ids. In addition to ids, we also any options that each pcode directive specifies.

Once we have collected ids and options, we can generate a js script called katex_autorenderer.js, which contains all renderElement function calls. As an example, it looks like

document.addEventListener("DOMContentLoaded", function() {
  pseudocode.renderElement(document.getElementById("37667c0e-b9e7-489f-b48e-d64117042cd2"), {lineNumber: true});
  pseudocode.renderElement(document.getElementById("37c7acbe-a36a-4260-a464-9fd6bff71a3c"), {lineNumber: true});
  pseudocode.renderElement(document.getElementById("2bace0ba-4113-4766-b226-13c7a6456925"));
  pseudocode.renderElement(document.getElementById("bb8f4069-52bd-48d0-b782-9d4b0038f2ec"));
  pseudocode.renderElement(document.getElementById("6f8008db-5388-46c6-938e-837c763d7ed9"));
});

Note

We cannot register katex_autorenderer.js at doctree-resolved event but we can do so instead at html-page-context. That's where install_js2_part2 comes from: we have to split js generation and js registration in two phases.

Support References

Support since 0.5.0

One important feature is to allow one easily reference the algorithm written in pcode directive. As documented in this issue, there are two considerations on how to implement this feature:

Like recipe extension example, we can add special reference role like :recipe:ref: to reference any paritcular algorithm.
Alternatively, just like figure, table, or sections, we can use numref role to reference pcode directives as well.

I use the 2nd option. To implement such feature, we need to leverage add_enumerable_node API. To understand how to use it, let's take a closer look at how :numref: works with figure. Suppose, in rst document, I have

.. _l7-fig2.3:
.. figure:: /_static/linear-programming/l7-fig2.3.png

   A toy example of LP

This code is rendered as

<figure class="align-default" id="id13">
 <span id="l7-fig2-3"></span>
 <img alt="../_images/l7-fig2.3.png" src="../_images/l7-fig2.3.png">
 <figcaption>
  <p>
  <span class="caption-number">Fig. 97 </span>
  <span class="caption-text">A toy example of LP</span>
  <a class="headerlink" href="#id13" title="Permalink to this image">¶</a>
  </p>
 </figcaption>
</figure>

id13 is the numbering automatically incremented by add_eumerable_node API. l7-fig2-3 is the reference label, which has to be unique. Then the structure of the HTML code is: <figure> corresponds to figure:: directive and <img> is the actual content (i.e., image) of the directive. Caption indicated by <figcaption> follows content immediately. Thus, in practice, we use three docutils nodes to correspond these three components (directive, content, caption). The best way to learn how to use add_enumerable_node is to study stuffCounter code repository.

In fact, as shown by this code, there is a much simpler way to use add_eumerable_node API if the directive has a required caption option. Then, we can directly pass a caption getter function to the API. The reason we need to use three nodes implementation style is that we want to set $\LaTeX$ algorithm numbering the same as pcode reference numbering. As an example, suppose we have the code

.. _dummy-algorithm:
.. pcode::

    \begin{algorithm}
    \caption{Dummy Algorithm}
    \begin{algorithmic}
    \PRINT \texttt{'hello world'}
    \end{algorithmic}
    \end{algorithm}

Then, it will be rendered as

The algorithm has numbering 1. Then, when we reference the algorithm via :numref:`dummy-algorithm, we want to have "Algorithm 1" rendered in HTML as well. To do so, we extract the numeric part from id of the directive (e.g., 13 in id13 in the figure HTML example) and use captionCount option offered by pseudocode.js to manually set $\LaTeX$ numbering.

Development Environment Setup

To concludes this post, I describe the setup I have for extension development. I reference this post and code for my setup. Here is my configuration in PyCharm

The central idea is to run/debug demo docs using sphinx-build. Since the whole project is organized as a python package, we need to modify conf.py to allow the demo docs to automatically find the extension source code but not via python package installation.

Conclusion

That's all I have to say about implementing sphinxcontrib-pseudocode. Hope this post becomes useful when you build your own extension to sphinx-doc.

You can also see the explanation in the "TODO" extension provided by sphinx-doc. ↩
I also reference how sphinx-doc installs MathJax.js for this part. ↩

Build My First PC

2021-09-29T02:11:00+08:00

Requirement
Background
Tools
My Choice
Reference

This page contains my note on how to build a PC.

Requirement

I use this PC mostly for research. The requirement is listed below:

Large memory. 32GiB at least and with support to expand to 64GiB. This requirement is for building large software projects fast enough. In addition, large memory allows me to work on in-memory/distributed database development and benchmarking.
Support for video editing. Streaming support is optional but good to have.
Gaming support is not a top priority.

Background

I haven't built a PC before. This section documents background knowledge helping me make decisions on picking parts.

Note

Components not need to research and purchase this time: monitor, keyboard, mouse.

Motherboard

Motherboard connects graphics cards, RAM, the CPU, power supply, and storage drives to allow them interact with each other.

Criteria to consider when choosing a motherboard:

Compatibility with other parts, most notably with CPU (CPU socket type), and overclocking support
Number of DIMM (Dual In-Line Memory Module) Slots (i.e., memory slots, RAM slots).
M.2 compatability
PCIe generation
PCIe slot size
Motherboard form

Detailed notes:

CPU compatibility varies greatly between motherboards, since different generations and brands of processors feature different chipsets and socket types
- chipset: term used to describe a specific hardware configuration of a motherboard. All motherboards with the same chipset have a certain group of features in common, such as the same CPU socket, PCIe generation supported, RAM generation supported, and overclockability. For example, within Intel’s motherboard 500-series there are currently four different chipsets: H510, B560, H570, and Z590. These all have different features, such as different numbers of PCIe lanes, different USB and SATA port counts, and different overclockability.
- socket type: refers to CPU socket type. CPU socket houses CPU and all major components are connected to the CPU socket. For example, Intel's 400 and 500-series motherboard has a LGA 1200 socket, which supports Intel's 10th and 11th generation CPU.

Note

AM4 has been the standard socket type for AMD since the introduction of their 1000 series (Zen 1) CPUs in early 2017. Additionally, 500-series motherboards are not compatible with 1000 or 2000 series CPUs, despite having the same socket type. Intel's 8th and 9th generation CPUs are only compatible with 300-series motherboards (known as LGA 1151 Revision 2), while 6th and 7th generation CPUs will only work with 100 and 200-series.

Ensure motherboard has M.2 slots ¹ if deciding to use M.2 storage drive. If a motherboard supports PCIe gen 4, it will usually have at least one M.2 slot that also supports gen 4. This allows for extremely fast data transfer speeds with a gen 4 SSD. Also, make sure to check the bus type supported by the slot (e.g., SATA bus or PCIe bus).
PCIe generation: Newer versions of PCIe have higher bandwidth, which translates to better performance. For example, PCIe 4th generation (the newest revision) has twice the bandwidth of PCIe gen 3. However, besides motherboard PCIe generation support, we also need to check if CPU PCIe generation support. Let’s say we have a PCIe gen 4 SSD. For it to transfer data at gen 4 speeds, both the CPU and the motherboard must support 4th-generation PCIe as well. Otherwise the storage drive will only run at gen 3 speeds (PCIe is backwards compatible).
PCIe slots have different sizes: typically, x1, x4, x8, and x16. For example, "PCIe 3.0 x4" refers to a Gen 3 expansion card or slot with a four-lane configuration. The bigger the size, the more lanes the slot supports. CPUs support only a limited number of PCIe lanes, and the quantity varies between models. A PCIe x16 slot, for example, uses 16 of these lanes, while a PCIe x4 slot uses only four. Different components use different numbers of lanes. For instance, discrete graphics cards use 16, while PCIe SSDs use four apiece. Also, since number of PCIe lanes that CPU + motherboard support is fixed, some slots share lanes whereas other slots may have dedicated ports. See this reddit's post as an example.
Check whether motherboard support PCI Express NVMe drives in the BIOS for the drive to act as a bootable device.
There are three types of motherboard:
- ATX: This is the full-sized motherboard, and comes with multiple PCIe slots and four RAM slots. It costs more and takes up more space than the other two sizes, but in exchange allows for greater customizability; you can have more RAM and run dual graphics cards if desired.
- Micro-ATX: This is the mid-sized motherboard. It is generally cheaper than its smaller counterpart, and comes complete with two to four RAM slots and one PCIe slot. This should be enough for just about any PC build, since theoretically you could have 128GB of RAM and just about any graphics card.
- Mini-ITX: This is the smallest motherboard, but it’s typically more expensive than a Micro-ATX. It is usually only used when you need to fit your PC in a very tight compartment.

CPU

CPU controls every other component and sends instructions to the rest of system.

Criteria to consider when choosing a CPU:

If not having a separate graphics card, CPU needs to have integrated graphics.
CPU tiers
CPU designations

Detailed notes:

There are two main brands to choose from: Intel and AMD. There are many generations of CPU’s (Intel is on its 11th generation, and AMD’s Ryzen processors are on their 5th). However, the most important is the CPU tiers. There are four main categories to choose from:
- The 3-tier: Intel's Core i3 and AMD's Ryzen 3 processors. These are the cheapest processors, but also the least powerful. These are typically the best choice if you’re looking to build a PC for basic office-type functions. They’re also great for budget gaming; recent i3 and Ryzen 3 processors like the i3-10100 and Ryzen 3 3100 can easily run AAA games at upwards of 144FPS, making them ideal for budget gaming rigs.
- The 5-tier: Intel's Core i5 and AMD's Ryzen 5 processors. These are considered by many to be the ideal gaming processors; they’re not as expensive as an 7 or 9 category, but get similar in-game performance. An i5 or Ryzen 5 is good if you’re looking to build a high-mid tier gaming rig, or if you need to run office-type software (like word processors, Internet browsers, or spreadsheets) at maximum speed.
- The 7-tier: Intel's Core i7 and AMD's Ryzen 7 processors. These are extremely powerful processors and are capable of running games at very high framerates when paired with a good graphics card, and capable of running many programs at the same time with no trouble. Suitable for doing things like video editing that will require more cores. They may also be the best option for you if you plan on having a very large number of programs running simultaneously.
- The 9-tier: Intel's Core i9 and AMD’s Ryzen 9 processors. These are the most powerful CPUs available, and consequently the most expensive.
CPU Designations: some CPUs have a letter tacked on at the end (e.g., i7-11700K, Ryzen 3 3200G). These letters at the end of the model denote certain traits. The most common processor suffixes are listed below:
- (intel) K: means that a specific processor is unlocked, and able to be overclocked. Intel processors are locked by default, and are not overclockable unless they are a “K” model.
- (intel) F: means that a processor requires discrete graphics to operate. Intel processors come with integrated graphics by default, but F-designated processors lack this feature, and need a graphics card to generate an image.
- (intel) T: These processors are power-optimized, and generate much less heat than their standard counterparts. They’re also less powerful.
- (ryzen) G: means that the processor has integrated graphics. These are also known as APUs, or Accelerated Processing Units.
- (ryzen) X: means that the processor is has a slightly faster factory clock speed than its non-X counterpart. For example, the Ryzen 5 3600x is slightly faster than the Ryzen 5 3600. In other words, it’s overclocked out of the box.
Some pointers for performance comparison: Tom's Hardware CPU Benchmarks and Performance Hierarchy Charts, AMD vs. Intel: CPU Value Comparison (2021)

CPU Cooler

A CPU generates a massive amount of heat, and without proper cooling it will throttle and shut down. A CPU cooler helps avoid this by dissipating heat away from the surface of the CPU and blowing it somewhere else, typically the case. From there the air is blown out by case fans, keeping components temperatures low. Thermal paste is applied between the cooler and CPU helps the cooler remove heat from the CPU and keep it from overheating.

Two types of cooling systems:

Air coolers: They typically use a combination of a metal heatsink and fan(s) to move heat away from the CPU. All Intel and AMD stock coolers (coolers that come with the CPU) are air coolers.
Water coolers: formally called all-in-one (AIO) water coolers. These systems use a constantly cycling loop of water to cool the CPU. Cool water runs by the processor, and heats up as the CPU transfers heat to it. It continues around the loop and is cooled by a radiator, and the cycle continues.

Criteria to consider when choosing a CPU Cooler:

cooler's Thermal Design Power (TDP) rating.

Detailed notes:

Unless you plan to overclock your CPU (boost the speed it runs at to get better performance from it), the default cooler that comes with your processor should be more than adequate. Most high-end processors don’t come with one though.
Cooler’s TDP rating: TDP actually refers to the amount of heat that a component puts out, but coolers are rated based on how much heat they are able to disperse. A cooler with a TDP of 250W, then, should be able to keep a CPU with a TDP of 250W cool. For reference, the large majority of CPUs are 125W or lower, so a 250W cooler is very powerful. Thus, look up your CPU’s TDP and buy a cooler that can handle that amount of heat. This is only if you bought a high-end processor, or if you’re looking to overclock yours.

Graphics Card

Graphics card is also called GPU (Graphics Processing Unit) or video card, renders the images on the screen.

Criteria to consider when choosing a graphics card:

Synergy between CPU and graphics card to prevent bottlenecking

Detailed notes:

You won’t need a graphics card for basic office utilities, but for things like gaming, 3D rendering, and high-resolution video editing it’s essential to have a graphics card.
The best way to decide on your ideal graphics card is to look up benchmarks and figure out what card will meet your needs.
Designations: Graphics cards sometimes have suffixes added to the end of them, which denote better performance than the standard version. For Nvidia, these include "Ti" and "Super", while Radeon cards use "XT." For example, a 1660 Ti or 1660 Super is better than a regular 1660, and a Radeon 5700XT is better than a 5700. In general, "Ti" cards tend to be marginally better than their "Super" counterparts, but the difference is typically negligible.
Bottlenecking: If you buy a CPU that is significantly more powerful than your GPU, or vice versa, this can result in a CPU or GPU bottleneck. What this means is that one piece of hardware is maxing out while the other is not using close to its full potential. You’d be better off buying a more powerful CPU and spending a little less on your graphics card, since this way you will get more frames per second at the same cost.
Performance benchmarks: Tom's Hardware GPU Benchmarks and Hierarchy 2021: Graphics Cards Ranked
Tools: Compare PC Processort + Graphics Card benchmarks

Storage Drive

The storage drive(s) in determine computer's storage capacity. There are two main types of drive: HDD and SSD. Furthermore, the SSD category is broken into 2 types of its own: NVMe and SATA.

HDD: HDD stands for Hard Disk Drive
SSD: SSD stands for Solid State Drive. SSD retreives information much faster than HDD. Cost more than HDD.
- SATA: SATA stands for Serial Advanced Technology Attachment, and refers to the motherboard port that SATA SSDs plug into. These SSDs are slower than their NVMe counterparts, but still much faster than traditional hard drives.
- NVMe: NVMe stands for Non-Volatile Memory Express. More than 5 times faster than most SATA SSDs.

Criteria to consider when choosing storage drives:

storage capacity
SSD vs. HDD
SATA vs. NVMe
Compatibility with motherboard

Detailed notes:

NVMe drive: check read/write speeds, PCIe generation and its compatibility with motherboard slots (e.g., M.2 Slot with PCI4 gen 4 works best with PCIe gen 4 NVMe drive)
M.2 slots may have following specification "2242/2260/2280/22110 M-key", "2242" indicates the size of SSD that M.2 slot supports: 22 width in millimeters and 42 length in millimeters. M-key means M.2 slot supports M.2 SSD with M-key. An M.2 SSD with M key doesn't fit in a motherboard with only B key and vice versa. The key is also relevant to the speed: SSDs with an M key can handle a higher speed than versions with a B key. More detailed explanation in Dell KB.

RAM

RAM stands for Random Access Memory, and the amount of RAM you buy will determine how much temporary data you can store for near-instant access.

Criteria to consider when choosing RAM:

DDR Type
Capacity
Speed
Latency

Detailed notes:

DDR4 is the latest RAM technology and is about twice as fast as its predecessor DDR3. DDR3 can no longer with modern motherboards.
All RAM modules come with an advertised clock speed. 3200MHz (3200 cycles per second) is the sweet spot in price to performance. We recommend this for the majority of users. If you opt for a Ryzen processor you may benefit from faster RAM. 3600MHz is probably your best bet in this scenario.
Generally in the RAM’s product description or product title, we can see a number that reflects the CAS latency. It will usually be written as "Cxx" or "CLxx". C16 or less is ideal; you’ll find that the majority of 3200MHz RAM is C16. Realistically, you won’t notice the difference in any RAM below 20, but don’t buy memory with a latency higher than that.

Power Supply Unit (PSU)

PSU is computer’s power source. It directs electricity from a wall outlet to a computer’s motherboard, where it can be distributed to all of the components as needed.

Criteria to consider when choosing a PSU:

Wattage
Modularity
Brand
Efficiency

Detailed notes:

Wattage: when choosing a power supply, the primary consideration should be its wattage. The safest method is to use NewEgg’s Power Supply calculator or a similar tool to see exactly how much power your system will drain. Multiply this number by 1.3 and then round up to the next multiple of 50. That’s the power supply wattage you want. Example, if Newegg's estimates that total system wattage will be 600W, then we buy 800W PSU ($600 * 1.3$ and round up to the nearest multiple of 50). This is a good rule of thumb to use in order to account for any sudden spikes in energy usage that may occur. Most calculators estimate based on a component’s power draw at low usage, rather than at max load. Ten times out of ten, you’ll be better off having too much power rather than too little.
Modularity: Modularity means, essentially, the customizability of a power supply.
- Fully modular means that every single power cable can be removed, allowing you to only use cables that are needed.
- Non-modular power supplies have all of the cables built-in, and you are unable to remove them. This means that, with non-modular power supplies, you will probably end up having excess cables that aren’t connected to anything that are still taking up space in your case. The only benefit is that non-modular PSU's are cheaper.
- Semi-modular (a hybrid between non-modular and fully modular) is recommended and most practical: The essential cables, such as the ATX cable ² that powers your motherboard and your CPU cable, are built in. Other cables like the 8-pin used for most graphics cards are modular, so you can use them if needed, but not have extra unused cables in your case.
Brand: recommend EVGA or Corsair PSUs. Cheaply-built power supply has the potential to ruin the rest of your components, start a fire, and generally wreak havoc on your system.
Efficiency is measured by "80-Plus" rating. When a PSU sends power to your computer, some percentage of the power from your outlet never reaches the computer, and is instead released as heat. The more heat is released, the less power reaches the computer and the less efficient the PSU is. If 18% of the total wattage coming from the wall is lost in transit to your PC, your PSU is 82% efficient, and thus would earn a Bronze rating. See 80-Plus rating chart. Normally, stick to Bronze rating unless the maginal efficiency improvement (efficiency improved / cost difference) is significant. For example, 9% jump in efficiency at full load with $75 cost more, it would take nearly 10,000 hours, or over a year, of your computer running at full speed to save the $75 extra you spent on the Titanium (assuming a $.12/kwh electricity cost).

PC Chassis

Chasis (also called case, tower) holds all of the components inside, and is the hub to which user connects almost all external cables. These include display cables like HDMI and DisplayPort, USB connectors, Ethernet cables, audio jacks, and more. Computer cases come in three main sizes: Full tower, Mid tower, and Mini tower.

Criteria to consider when choosing a case:

Need
Case with fans (=> good airflow)
Compatibility between case and motherboard
Cable management (less important)

Detailed notes:

Unless you need to store your computer in a very small compartment, Mid or Full towers are the best option. If you get a Mini tower you’ll probably need to get a smaller form-factor ITX motherboard, which will cost you more money than a standard one.
If you need a case with a CD/DVD tray, SD card reader, or anything else specific make sure to ensure that the one you order has those features.
Look for a case that comes with fans installed, as this will help keep your entire system cool and allow for good airflow. Ideally, you’ll have at least one in both the front and back. Alternatively, you can buy extra system fans if your case is compatible.
Ensure getting a case with a good airflow. This requires actually taking a detailed look at case or its pictures.
Cases will have a set list of motherboard types they support so, for example, a smaller case probably won’t support a full-sized ATX motherboard.

Tools

PC Part Picker to perform compatibility check on parts.

My Choice

CPU: AMD Ryzen 9 5900X 3.7 GHz 12-Core Processor
CPU Cooler: Noctua NH-D15 chromax.black 82.52 CFM CPU Cooler
Motherboard: MSI MPG B550 GAMING EDGE WIFI ATX AM4
Memory: Crucial Ballistix 64 GB (2 x 32 GB) DDR4-3600 CL16 Memory
Storage: SK hynix Gold P31 1 TB M.2-2280 NVME Solid State Drive
Case: Fractal Design Meshify C ATX Mid Tower Case
Power Supply: Super Flower Leadex III Gold 850 W 80+ Gold Certified Fully Modular ATX Power Supply
Graphics Card: A friend gives me a used graphics card

Reference

What Are the Parts in a Computer? Basic Components Overview
How to Pick your PC Parts: Component selection basics
Motherboard Anatomy: Connections and Components of the PC Motherboard
Buying a Solid-State Drive: 20 Terms You Need to Know
The Best PCI Express NVMe Solid State Drives (SSDs) for 2021 despite its title, the article contains many points you need to consider when picking a SSD and slots support from motherboard.
What should I keep in mind when buying a M.2 SSD? explains B-key and M-key difference and M.2 slot specification
What are PCIe x1 Slots Used For? introduces general concepts on PCIe slots and lanes with an answer to PCIe x1 slot.
Exactly How to Choose a CPU: Complete Guide
How to Check PCIe M.2 NVMe SSDs Compatibility with your PC or Motherboard

For M.2 storage drives or WiFi/Bluetooth expansion cards. Learn more here. ↩
Corresponds to Motherboard Power Connector (i.e., ATX Power Connector) on the motherboard. ↩

"Synthesizing Data Structure Transformations from Input-Output Examples"

2021-01-28T04:20:00+08:00

The paper presents a method to synthesize functional programs that transform recursive data structures (e.g., lists, trees)
- Examples: see Figure 6 (e.g., join, cprod)
- essentially shows how one can orchestrate a list of operators to generate the desired program
Three techniques
- type-aware inductive generalization
  - purpose: create hypotheses that represent some or all properties of the target program
    - can be open -> contains "holes"
    - can be closed -> if is consistent with examples, this is the program
  - input: inferred types from examples
    - use types -> prune the search space
  - output: a stream of hypotheses (closed and open) that match types
  - how:
    - For open hypothesis, application of a higher-order combinator (Figure 3) to set of (known or unknown) arguments (use inferred types of examples to guide the selection of combinator)
    - For close hypothesis, recursive procedure like enumerate search
- deduction
  - purpose: solve unknown functions in hypotheses
  - Two techniques:
    - Refutation: use counter-example to reject hypotheses
    - Example inference: uses properties of combinators to infer new examples on unknown functions
- best-first enumerate search
  - weighted BFS idea -> use priority queue
  - weight created from cost model: simple is better (avoid degenerate case: prog packed with if-blocks)
Main algorithm
- a priority queue $Q$ of subtasks $(e, f, \mathcal{E})$
  - $e$: a hypothesis
  - $f$: a hole in hypothesis
  - $\mathcal{E}$: a set of examples
- pop the head of queue and obtain a subtask to work on
  - if $e$ is closed, checking agains input examples
    - ok -> got a solution; discard otherwise
  - if $e$ is open
    - synthesize $f$ in $e$
      - infer type of $f$ from $\mathcal{E}$
      - use inductive generalization to create a stream of hypotheses $H$ for $f$
      - replace $f$ with each $h \in H$ to obtain a set of new hypotheses. Say one of them is $e'$
        
        if $e'$ is closed -> put $(e', \perp, \emptyset)$ as a subtask to queue
        
        if $e'$ is open
        
        new subtask for each hole $f^*$ in $e'$
        
        uses deduction on $f^*$ to create new examples $\mathcal{E^*}$ or refute
        
        add $(e', f^*, \mathcal{E^*})$ as a subtask if not rejected

Graph Data Models

2021-01-14T23:20:00+08:00

It has been a very long time since my last post. As you might know, I left AWS and started to work on my PhD. Research takes almost all my time. I have written a lot for my research, which consumes all my blogging energy. However, I decided to try out a new format on blogging - blogging topics around my research via short videos. Below is my first one. Please let me know what you think in the comment section below.

Here is a short summary on the content:

In the video above, I talked about two commonly-seen data models used by graph databases

Edge-labelled graph (basis of Triplestore)
Property graph (used by Neo4j, TigerGraph, etc)

In addition, I discussed the difference between graph models and the relational model by showing how we can model the same piece of data from The Office under different models. In addition, I listed the advantages that graph models can offer from user perspective.

Note

For Chinese viewers, you can find the video on BiliBili.

Status Update

2020-03-04T12:20:00+08:00

As you all probably have noticed, I have been quiet since October last year. Here is a status update on my side: September last year, I was laid off due to the strategic change of my last employer and I was busy with job hunting. Luckily, I'm able to join AWS in January this year to work on a new database service that will be launched later this year. In my spare time, I'm working on a project that is not readily available to the public and that steals lots of blogging time. However, the good news is that I'm still writing: I have written a set of notes that will be made public once the time is right. So, stay tuned!

Semaphore

2019-10-20T17:10:00+08:00

Concurrency is a big topic that I'm planning to write about more for the upcoming days. In this post, I'll cover the concept "semaphore", a very important concept when we talk about synchronization. I'll walk through concept and offer an implementation of semaphore in C++ with a working example. Lastly, I'll apply golang's channel concept to help us better understand semaphore.

Why Semaphore?
Concept and Implementation
Example
Glimpse from Golang
Summary

Why Semaphore?

Lock (e.g., mutex), condition variable, and semaphore are three tightly-coupled concepts that everyone will learn about in their undergraduate OS course. Some textbook presents those concepts in the ordering shown above and talks about the missions that each concurrency primitive can achieve. In practice, however, I always find it is challenging to make choice when multiple options present for handling a specific usage scenario. The following is some rules of thumb I like to use when work with those concepts in practice:

Semaphore replaces "everything": I can use semaphore to work both as lock to protect shared data across multiple threads (i.e., threads can access data but only one at a time) and as condition variable ¹ to order the events (e.g., ordering the access of shared data; ordering the execution of threads).
Mutex is just a easeir tool to use under lock scenario. In other words, we can use semaphore as the lock but implements semaphore itself requires more LOC than mutex (i.e., if we only need lock, use mutex).
Use condition variable and mutex to implement semaphore and use semaphore as the abstraction for the rest of code. Certainly, one can say using condition variable + mutex for threads ordering but essentially, he is using semaphore (by constructing it with condition variable + mutex first) without explicitly stating it.

Note

I find how we understand those three concepts implicitly ill shaped by C, which is the default language used in undergraduate OS course. For example, C POSIX has a nice "semaphore.h" inteface that allows user to use semaphore directly. Such existence of interface makes one to direct use semaphore.h for using semaphore and pthread.h for condition variable and mutex. The consequence of such usage is that one can easily think that if they are using condition variable + mutex, they are using condition variable instead of semaphore on conceptual level. This is wrong! What they are really doing is to first construct semaphore using condition variable + mutex and then using semaphore to achieve the desired goal. The only difference between their usage and using semaphore directly is the lack of semaphore as an object abstraction. From this angle, I think C++ with LLVM compiler on Mac does a much better job by deprecating POSIX semaphore.h C library. This forces me to use standard mutex and condition variable to implement semaphore first before using it (also makes the code much more portable), which helps me to discover this sublte relationship among three concepts that is masked out in OS and C world. However, I'm not stating that all the condition variable + mutex usage will be equivalent to semaphore but semaphore can achieve the same purpose as condition variable with cleaner encapsulation.

Hopefully, by now, I can convince you why we need semaphore: it is such a indispensable tool in the pocket when we want to deal with commonly-seen concurrency situation: used as the lock to protect shared variable; used as ordering mechanism to facilitate threads execution order and concurrent event ordering.

Concept and Implementation

Before we jump into semaphore, let's revisit mutex and condition variable concepts first because we will leverage those two concepts in our C++ implementation. Mutex is used to allow many threads to acess the same variable but only do so one at a time. It is a useful tool to avoid data race: a situation where two threads acess the same variable concurrently and at least one of the access is a write (e.g., Alice deposits $200 into bank with initial balance is 0; Bob deposits $100 to the same account and the final balance can be $200 ²). Whenever a thread wants to modify a shared variable, it needs to acquire mutex first and release the mutex after the modification. Condition variable is used to put one thread to sleep until the condition the thread is waiting for comes true. A condition variable is an explicit queue that threads can put themselves on when some state of execution (i.e., some condition) is not as desired (by waiting on the condition); some other thread, when it changes said state, can then wake one (or more) of those waiting threads and thus allow them to continue (by signaling on the condition).

A semaphore is an object with an integer value that we can manipulate with an increment-value method (let's denote such method as post()) and a decrement-value method (let's denote such method as wait()). Then, the semantics of semaphore is defined by the functionality of post() and wait().

Note

In the following, when we talk about the value of semaphore, we really mean the integer value contained inside semaphore.

void post() {
  // increment the semaphore value by one
  // if there are one or more threads waiting on the semaphore, wake one
}

void wait() {
  // decrement the semaphore value by one
  // wait if the resulting semaphore value is negative
}

Here is the pseudo code for the semaphore implementation:

Now, let's implement semaphore in C++ with the semantics stated above.

class Semaphore
{
private:
  size_t avail;
  std::mutex m;
  std::condition_variable cv;

public:
  // only one thread can call this; by default, we construct a binary semaphore
  explicit Semaphore(int avail_ = 1) : avail(avail_) {}

  void wait()
  {
    std::unique_lock<std::mutex> lk(m);
    cv.wait(lk, [this] { return avail > 0; });
    avail--;
  }

  void post()
  {
    std::unique_lock<std::mutex> lk(m);
    avail++;
    cv.notify_one();
  }

  size_t available() const
  {
    return avail;
  }
};

Since we modify the integer value of semaphore (avail) that can be updated by multiple threads concurrently, we need to use mutex (m) to ensure only one thread doing the update. In addition, since threads need to wake up or wait depending on situation, we need to use condition variable (cv). The implementation itself follows some standard practice when working with condition variable.

Note

Here, we describes the procedure in C++ context; but steps generally applied in other languages. Also, the standard practice normally done by the thread. However, since the thread simply calls post() and wait(), the practice is now presented in the post() and wait() implementation.

For thread calling post() (i.e., the thread that intends to modify the value of semaphore), post() has to:

acquire a std::mutex (usually via std::lock_guard)
perform the modification while the lock is held (e.g., avail++)
execute notify_one or notify_all on the std::condition_variable ³

For thread calling wait() (e.g., any threads that intends to wait on the condition variable in semaphore), wait() has to:

acquire a std::unique_lock on the same mutex as used to protect the shared variable
execute wait ⁴, wait_for, or wait_until. The wait operations atomically release the mutex and suspend the execution of the thread.
when the condition variable is notified, a timeout expires, or a spurious wakeup occurs, the thread is awakened, and the mutex is atomically reacquired. The thread should then check the condition and resume waiting if the wake up was spurious ⁵.

One might probably notice, there is no explicitly unlock or using while to check for the wait condition (avail > 0) inside the implementation. Actually, those operations happen but are hidden by the C++ library implementation:

We acquire mutex lock via std::unique_lock and since std::unqiue_lock guarantees an unlocked status on mutex on destruction and std::unqiue_lock destructor will be automatically invoked when the wait() or post() exits, mutex will be unlocked on function exit.
std::condition_variable::wait has the function signature template< class Predicate > void wait( std::unique_lock<std::mutex>& lock, Predicate pred ); (e.g., cv.wait()) and inside the function, it does

while (!pred()) {
    wait(lock);
}

Thus, cv.wait(lk, [this] { return avail > 0; }); can be expanded as

while (avail <= 0) {
    wait(lk);
}

An intuitve read of cv.wait(lk, [this] { return avail > 0; }); is the thread will wait until avail > 0, the condition contained in the lambda function argument. This can probably saves some brain power to do the expansion shown above.

To make the implementation easier to understand, I borrow the reference implementation in C from OSTEP. Hopefully, it will make the C++ implementation clearer:

typedef struct __Zem_t {
  int value;
  pthread_cond_t cond;
  pthread_mutex_t lock;
} Zem_t;

// only one thread can call this
void Zem_init(Zem_t*s, int value) {
  s->value = value;
  Cond_init(&s->cond);
  Mutex_init(&s->lock);
}

void Zem_wait(Zem_t*s) {
  Mutex_lock(&s->lock);
  while (s->value <= 0)
    Cond_wait(&s->cond, &s->lock);
  s->value--;
  Mutex_unlock(&s->lock);
}

void Zem_post(Zem_t*s) {
  Mutex_lock(&s->lock);
  s->value++;
  Cond_signal(&s->cond);
  Mutex_unlock(&s->lock);
}

Example

As one can probably see, semaphore doesn't offer much freedom when come to how much we can customize it. The only thing that can be set by the user is the initial integer value of the semaphore (e.g., avail). In fact, that's the beauty of the semaphore: we can achieve various purposes with semaphore by initalizing it with different values. For exmaple, we can intialize the semaphore with value 1 to make the semaphore work like a lock (such semaphore is called binary semaphore):

Semaphore lk(1);
int balance = 0;
// deposit called by multiple threads to update balance
void deposit() {
  lk.wait();
  balance += 100;
  lk.post();
}

Here is an intuitive description to understand semaphore based on this deposit example: Suppose there are five people want to withdraw money from the bank. Semaphore works like a key hanging on the bank door. Since the semaphore value is intialized with one, there is only one key hang on the door. The first person arrives at the bank is able to grab the key, open the door, and withdraw money. The rest four people have to wait in line for the key. Once the first person finishes the withdrawing, he puts the key back to the door and pat the next person's shoulder and tells her the key is available and she can go in. Now, the second person checks the key is indeed hang on the door and now she can grab it to withdraw. The whole process repeats for the rest three person waiting in the line. This describes the five threads with one semaphore situation.

We can also set semaphore value properly to order the threads execution. Consider the Leetcode Problem: 1114. Print In Order where we pass the same object Foo into three threads and let them calling appropriate print methods to print "first", "second", and "third" in such ordering without restricting which thread is assigned to print which word. For detailed example, see the problem description. The following implementation shows how we can use semaphore implementation shown above to achieve such ordering purpose: always print "first", "second", and "third" in order:

class Foo
{
  Semaphore firstJobDone;
  Semaphore secondJobDone;
public:
  Foo(): firstJobDone(0), secondJobDone(0) {}

  void first(function<void()> printFirst)
  {
    printFirst();
    firstJobDone.post();
  }

  void second(function<void()> printSecond)
  {
    firstJobDone.wait();
    printSecond();
    secondJobDone.post();
  }

  void third(function<void()> printThird)
  {
    secondJobDone.wait();
    printThird();
  }
};

Both firstJobDone and secondJobDone have semaphore value of 0 when the Foo class is constructed. Both second() and third() will call wait() method on respective semaphore; unless post() is called before wait(), the value of semaphore will be 0, which will put threads calling second() and third() pause waiting on respective semaphore. Since first() doesn't call wait() initially, first() will be executed first. Then, firstJobDone.post() will be called, which will bring fristJobDone semaphore value to 1 and wake up the thread that wait on the semaphore: in this case, it's the thread calling second(). Since secondJobDone semaphore value is still 0, thread calling third() will continue to pause until thread with second() done the work and increment the semaphore value. The complete implementation with test can be seen here. Note that semaphore implementation used is abstracted in mysemaphore.h with implementation here. Note that for this specific question, we initialize semaphores in the Foo class. This is ok in this situation as there will be only one Foo object exists. However, a more common case is to intialize semaphores globally.

Note

Probably one may have noticed that the semantics of semaphore implementation has been slightly different than the semantics of wait() and post() we defined above with the pseudo code comment. In fact, semaphore intially has the invariant that: the value of the semaphore, when negative, is equal to the number of waiting threads ⁶. We use invariant in the semantics definition instead of the actual implementation is to help remember semaphore functionality. We certainly can modify the implementation to make it follow the invariant: we switch type of avail from size_t to int and change the wait condition from avail > 0 to avail >= 0. The complete modified implementation can be seen here. Note that to use this version of implementation in the "print in order" example above, we need to intialize both semaphores to -1 instead of 0.

Glimpse from Golang

Recently, I learned about Golang for the work. The idea of using channel, the concept based on communicating sequential process (CSP), as the fundamental concurrency primitive really opens my mind (I've seen the idea in Python and Rust but Golang abuses the concept heavily). Such novel construction naturally brings new angle to re-examine the old concept: semaphore. The following examples are taken from The Go Programming Language with page number 241, 250, and 262.

The purpose of semaphore can be seen from limit parallelism: for example, no more than 20 concurrent calls to open a file. From the golang world, we can use a buffered channel of capacity $n$ to construct the semaphore (or counting semaphore if we want to be specific). Recall that a buffered channel has a queue of elements with the maximum queue size $n$ (i.e., capacity). A send operation on a buffered channel inserts an element at the back of the queue, and a receive operation removes an element from the front. Then the goroutine is blocked under the following conditions:

If the channel is full, the goroutine with send operation will be blocked until space is available in the queue due to a receive by another goroutine.
If the channel is empty, the goroutine with receiver operation is blocked until a value is sent to the queue by another goroutine.

Then, conceptually, to implement semaphore, each of the $n$ vacant slots in the channel buffer (i.e., queue) represents a token entitling the holder to proceed. Token is acquired by sending a value to the channel and is released by receiving a value from the channel. Receiving a value creates a new vacant slot that potentially allows other goroutine to send the value (i.e., acquire token). As one can probably see, with this setting, we can allow at most $n$ sends without an intervening receive.

Implement semaphore in this way is surprisingly simple: all we need is a buffere channel and perform send or receive appropriately. The book provides an example of crawling the web (the book is from Google engineers; no wonder):

// tokens is a counting semaphore used to
// enforce a limit of 20 concurrent requests.
var tokens = make(chan struct{}, 20)

func crawl(url string) []string {
    fmt.Println(url)
    tokens <- struct{}{} // acquire a token
    list, err := links.Extract(url)
    <-tokens // release the token

    if err != nil {
        log.Print(err)
    }
    return list
}

If we compare this Go's implementation with our C++ one, we can see that tokens <- struct{}{} works like wait(): if we cannot send value to the channel, we wait until the channel has some vacant spot. <-tokens works like post(): we're done and thus release token so that another goroutine can do the work. To generalize this implementation, we normally can:

// 20 is the intialized semaphore value
var semaphore = make(chan struct{}, 20)

func some_work() {
    semaphore <- struct{}{}        // acquire token
    defer func() { <-semaphore }() // release token
    // do the work ...
}

To see how semaphore work as a lock (mutex) in Go, we can revisit previous concurrent balance update situation and make both deposit and check the balance concurrent-safe:

var (
    semaphore = make(chan struct{}, 1) // a binary semaphore guarding balance
    balance int
)

func Deposit(amount int) {
    semaphore <- struct{}{} // acquire token
    balance = balance + amount
    <-semaphore // release token
}

func Balance() int {
    semaphore <- struct{}{} // acquire token
    b := balance
    <-semaphore // release token
    return b
}

Summary

Semaphore is a really powerful concurrency construct that can work like a lock, perform ordering, and limit parallelism. I always like to semaphore first and use it exculsively without scratching the head to write sppecific condition variable + mutex combo for each kind of situation. Certainly, there will be case using conditon variable + mutex is more straightforward but I find it is very rare.

The similarity of semaphore and condition variable is suggested by the origin of condition variable idea: Dijkstra's use of "private semaphore". ↩
Think about why this can happen. If it is not clear, see The Go Programming Language book p.259. ↩
"Always hold the lock while signaling" is one important tip for concurrency correctness mentioned in OSTEP. There is 4th step that release lock is omitted here because it is automatically taken care of by the destructor of std::lock_guard. ↩
Inside condition_variable::wait, unlocking the lock, blocking the current executing thread, and adding it to the list of threads waiting on *this are executed atomically. ↩
That's why we always use while instead of if when deciding whether to wait on the condition. In classroom, we normally learned the practice from the difference between Mesa semantics and Hoare semantics. See OSTEP for more details. ↩
The invariant is stated in one of Dijkstra's papers. ↩

Secure Connection with MariaDB: A Conceptual Approach

2019-09-14T23:20:00+08:00

In this post, I discuss how we can have a secure connection with MariaDB by first understanding the computer science fundamentals behind the steps. Once we have built the concept model, the steps linked above are self-explanatory. In addition, this post applies the same concept to understand SSL ccertificate.

Concepts
Examples
- Secure Connection with MariaDB
- SSL Certificate

Concepts

The key concept behind secure connection is certificate, which relies on digital signature. The following picture shows the key idea of digital signature:

Suppose PayPal wants to ask the personal information from the end user. To allow user verify that PayPal indeed sends the message and such message (e.g., "Please enter your personal info") is not modified, PayPal will use his private key to sign the message and distribute his public key to the end user. If the user can decode that message with PayPal's public key, then end user can confirm two pieces of information:

The sender is PayPal assuming the public key at hand truly belongs to PayPal (Authentication)
The message is not modified in transit (Integrity)

Now, let's examine further on Authentication. In the above scheme, the end user obtains public key from PayPal in order to decode the message that PayPal sends. This can be problematic: how do we know the public key the end user obtained is truly sent by PayPal? In other words, if a server owned by malicious attacker sends the public key to the user as if the server itself owned by PayPal, then the end user personal information will be in danager: of course the end user can use the public key from the malicious server to verify the authenticity and integrity of the message sent by malicious server; then the personal information sent by the user will be on the malicious server instead of on server owned by PayPal.

To patch the loophole mentioned above, we rely on a trusted third party to help user to verify the publick key is truly owned by the claimed party (e.g., public key we obtained from "PayPal" is truly owned by PayPal). Such trusted third party is called Certificate Authority (CA). CA verifies the ownership of a public key by issuing digital certificates, which essentially a digital signature signed by CA on a specific message sent by the public key owner. The following picture demonstrates this concept:

The most commonly seen example of digital certificate is the TLS/SSL server certificate, which is a certificate that server needs to present to the client during initial connection setup when establishing a secure connection required by TLS/SSL Protocol. The following picture shows the digital certificates used by this site, which is certified by COMODO (one of CA providers) and issued to Cloudflare, the domain name server provider:

A special message sent by the public key owner (e.g., PayPal) to obtain CA digital certificate is called Certificate Signing Request (CSR). For example, to create a private key (e.g., server-key.pem) along with CSR (e.g., server-req.pem) on Mac, one can do:

$ openssl req -newkey rsa:2048 -days 365000 -nodes -keyout server-key.pem -out server-req.pem
Generating a 2048 bit RSA private key
..............+++
.......................................+++
writing new private key to 'server-key.pem'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) []:
State or Province Name (full name) []:
Locality Name (eg, city) []:
Organization Name (eg, company) []:
Organizational Unit Name (eg, section) []:
Common Name (eg, fully qualified host name) []: Zeyuan's laptop
Email Address []:

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:

At this stage, avid reader like you might notice that there can still exist potential loophole: how do we know the public key the user obtained from "CA" is truly owned by CA? Here, CA makes an exaception here: CA issues its own CA digital certificate by signing on its own using its own private key (such digital certificate owned by CA is called CA certificate). Then we rely on regulation and industry auditing to ensure CA providers can play as a trusted third party.

One concept needs to clarify is that CA is not resources controlled by serveral public providers: large organizations usually have their own CAs as part of their own public key infrastructure. For example, if a company decides to build their own private cloud, they may have their own CAs to help with internal network traffic encryption. To show a concrete example, we can generate our own CA certificate like below:

# Generate CA private key
$ openssl genrsa 2048 > ca-key.pem
Generating RSA private key, 2048 bit long modulus
........................+++
........................................................................................................................+++
e is 65537 (0x10001)

# Self-signed CA certificate
$ openssl req -new -x509 -nodes -days 365000 -key ca-key.pem -out ca-cert.pem
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) []:
State or Province Name (full name) []:
Locality Name (eg, city) []:
Organization Name (eg, company) []:
Organizational Unit Name (eg, section) []:
Common Name (eg, fully qualified host name) []:Zeyuan's laptop
Email Address []:

Normally, in an organization setting (e.g., build an internal private cloud), several concepts emerge with CA certificate: root certificate, intermediate certificate, and chain of trust. The following picture demonstrates this concept:

As one can see in the picture, root CA certificate is issued by root CA itself, and such certificate is called "root certificate". The reason for such naming reflects such CA is the root in the tree structure, which means a point of ultimate trust for a CA hierarchy. The CA in the middle of the picture holds "intermediate certificate", which is a certificate signed by another intermediate CA or a root CA. Intermediate CA can sign CSR as if it is root CA. Requester (e.g., PayPal)'s certificate obtained via this way is secure because each CA in the CA hierarchy is validated by its ancestor, which can be traced back all the way to the root CA. Thus, requester's certificate can be treated as if it is directly obtained from root CA, the ultimate trusted party.

Take building private cloud as an example, one usage scenario of CA hierarchy is to eastablish secure connection between API server and database server. The following picture shows such organization:

There are two things need to note in the picture: one is the dot line between CA and API server, which indicates they are optional. TLS/SSL protocol does not enforce the client (e.g., API server) to present certificate (i.e., one-way TLS). However, if the database is configured to enable two-way TLS (MariaDB), then it is necessary for client to obtain certificate as well. The second is authentication and encryption is a separation of concern: for one-way TLS, we only care about encryption of API access to database and we can choose different method to perform authentication (e.g., user and password).

Examples

Secure Connection with MariaDB

Now, with the concept model we build so far, we can easily understand the listed step about setup secure connection with MariaDB, which have both CA and MariaDB sit on the same server and use two-way TLS for both authentication and encryption of database access.

SSL Certificate

For security connection, web server needs to obtain SSL certificate by sending CSR to CA. In addition, the web server usually also installs intermediate certificate to establishes the credibility of SSL Certificate by tying it to CA’s root certificate. The following picture shows why this works:

With intermediate CA's public key, client can assume the web server's public key and information is verified by the intermediate CA. With root CA's private key, client can assume the intermediate CA is verified by the root CA and thus, client knows the web server's public key and information is verified by root CA and by chain of trust, web server's SSL certificate can be trusted.

"Ceph: A Scalable, High-Performance Distributed File System"

2019-07-01T16:20:00+08:00

Problem
Background
System Design
Additional Reading

Problem

How can we design a distributed file system that:

are scalable (e.g., supports hundreds of petabytes and beyond; extreme workload case)
flexible to adjust to different workloads

while maintaining good performance?

Background

Object-based storage (an abstraction layer between application and hard disks)
- Instead of hard disks, use intelligent object storage devices (OSD) (= CPU + network interface + local cache with an underlying disk or RAID)
- OSDs allows clients to read or write byte ranges to much larger (variably sized) named objects (no block-level inteface)
- Distribute low-level block allocation decisions to device themselves
Traditional architecture
- Contact metadata server (MDS) for metadata ops + contact OSDS to perform file I/O
- Problems:
  - Single MDS is bottleneck
  - Traditional FS interface becomes legacy: allocation list, inode tables
  - OSDs can do more than just storing data

System Design

Design assumptions
- Large systems are built incrementally
- Node failures are normal
- Quality and character of workloads are shifting over time
High-level design
- Replace MDS with MDS cluster with dynamic metadata workload distribution
- Replace file allocation tables with data distribution function (e.g., CRUSH)
- Use OSDs for management of object replication, cluster expansion, failure detection, and recovery besides just data storage
Architecture
- Three components:
  - Client
  - MDS cluster: manages namespace (file names and directories); coordinate security, consistency, and coherence
  - OSDs cluster: stores data + metadata
Client
- Runs on each host executing application code
- Expose a file system interface to applications
- Can be linked directly by application or mounted as a FUSE-based file system
- File I/O:
  - client sends a request to MDS on file open; MDS returns file info + striping strategy (i.e., how th e file is mapped into a sequence of objects) + capability (i.e., permitted operations by clients)
- Synchronization:
  - Client I/O for the same file access has to be synchronized (i.e., blocked until acked by OSDs)
  - For performance-focus scenaro, allow application to relax consistency by providing POSIX I/O interface extensions
Note

POSIX semantics require: 1. reads reflect any data previously written 2. writes are atomic (i.e., the result of overlapping, concurrent writes will reflect a particular order of occurrence)
- Namespace operations:
  - Read operations (readdir, stat) and updates (unlink, chmod) are synchronized by MDS
  - Optimize common metadata access pattern (readdir followed by stat) (trade coherence for performance)
  - Offer POSIX interface extension (statlite) for application that don't need coherent behavior
  - Extend existing interface for performance (stat example)
Metadata management
- No file allocation metadata: object names = file inum + stripe number
- Objects distributed to OSDs using CRUSH
- Metadata storage
  - Use journals for MSDs to stream updated metadata to the OSD cluster and for MDS failure recovery
  - Inodes are embedded within directories, allowing the MDS to prefetch entire directories with a single OSD read request
  - Use anchor table to keep the rare inode with multiple hard links globally addressable by inum (avoid large but sparse inode table)
- Dynamic Subtree Partitioning
  - Adjustable to dynamic workloads (vs. static subtree paritioning) and maintain metadata locality and opportunities for metadata prefetching and storage (vs. hash)
  - How it works:
    - Each MDS measures the popularity of metadata within the directory hierarchy using counters with an exponential time decay
    - Any operation increments the counter on the affected inode and all if its ancestors up to the root directory
    - MDS load values (i.e., counters) are periodically compared, and appropriately-sized subtrees of the directory hierarchy are migrated to maintain load balancing
- Traffic control
  - The contents of heavily read directories (e.g., many opens) are selectively replicated across multiple nodes
  - Directories that are particularly large or experiencing a heavy write workload (e.g., many file creations) have their contents hashed by file name across the cluster
  - Clients can contact MDS server directly for rare metadata and are provided different MDS node for accessing popular metadata
Distributed object storage
- Logically as a single logical object store and namespace: Reliable Autonomic Distributed Object Store (RADOS)
- Data distribution with CRUSH
  - Distributes new data randomly; migrates a random subsample of existing data to new devices; uniformly redistributes data from removed devices
  - How it works:
    - Ceph maps objects into placement groups (PGs) using a simple hash function, with an adjustable bit mask to control the number of PGs
    - PGs are assigned to OSDs using CRUSH (Controlled Replication Under Scalable Hashing): a pseudo-random data distribution function that efficiently maps each PG to an ordered list of OSDs upon which to store object replicas
      - To locate any object, CRUSH requires only the placement group and an OSD cluster map: a compact, hierarchical description of the devices comprising the storage cluster
      - Cluster map incorporates clusters physical or logical composition and potential sources of failure
- Replication
  - Using a variant of primary-copy replication
  - How it works:
    - Data is replicated in terms of PGs, each of which is mapped to an ordered list of $n$ OSDs (for $n$-way replication)
    - Clients send all writes to the first non-failed OSD in an object's PG (the primary), which assigns a new version number for the object and PG and forwards the write to any additional replica OSDs
    - After each replica has applied the update and responded to the primary, the primary applies the update locally and the write is acknowledged to the client
    - Reads are directed at the primary
- Data safety
  - Two requirements:
    1. Low-latency updates (updates should be visible to other clients asap)
    2. Data is safely replicated after writes
  - How it works:
    - The primary forwards the update to replicas, and replies with an ack after it is applied to all OSDs' in-memory buffer caches, allowing synchronous POSIX calls on the client to return (satisfy requirement 1)
    - A final commit is sent when data is safely committed to disk (satisfy requirement 2)
- Failure detection
  - Each OSD monitors those peers with which it shares PGs (existing replication traffic as liveness signal); an explicit ping is sent if an OSD has not heard from a peer recently
  - An unresponsive OSD will have its responsbility (update serialization, replication) temporarily pass to the next OSD in each of its PGs
  - OSD that cannot be recovered will be out of data distribution and another OSD joins to re-replicate its contents
  - A small cluster of monitors collects failure reports and filters transient or systemic problems centrally (to ensure correct and availability of cluster map)
- Recovery and cluster updates
  - OSDs maintain a version number for each object and a log of recent changes for each PG
  - On cluster updates, OSD checks local PGs and adjust itself to the new PG groups
  - Version number is used to determine the latest PG version number
  - Log is used to determine the correct PG contents
  - Once OSD has the correct PG membership, each OSD independently updates its data by contacting peers
- Object storage with EBOFS
  - OSD manages its local object storage with EBOFS (Extent and B-tree based Object File System)
  - Why new file system (instead of using ext3)?
    - POSIX interface donesn't support atomic data and metadata update transactions
    - High latency for journaling and synchronous writes
  - EBOFS
    - User-space file system
    - Update serialization (for synchronization) is different from on-disk commits (for safety)
    - Supports atomic transactions (writes and attribute updates on multiple objects)
    - update function returns when in-memory cache updated with async callbacks on commit
    - Use B-tree to locate objects, manage block allocation, and index collections (PGs)
    - Use extents (instead of block list) for block allocation (for metadata compact)
    - Free block extents are binned by size and sorted by location for locality and avoid fragmentation
    - Metadata (execpt blokc allocation info) is in-memory
    - Use copy-on-write

Additional Reading

Paper review by Murat

"Mnemosyne: Lightweight Persistent Memory"

2019-05-07T12:20:00+08:00

Problem
Background
System Design

Problem

How can we design programming interface for persistent memory (i.e., storage-class memory)?

Background

Storage-class memory (SCM) provides interface of memory (load and store instructions) but the persistence of disks

System Design

Expose SCM as a persistent memory abstraction to provide direct access to the durability of SCM technologies
Goals
1. User-mode access to persistence: simple for a programmer to declare data as persistent
2. Consistent updates: support consistent modifications of data structures
3. Conventional hardware: compatible with existing commodity processors
Design assumptions
- Assumptions about SCM
  - Support an atomic write of at least 64 bits
  - Possible to stall execution until a write has made it all the way to SCM (similar to fsync in FS)
- Failure models
  - Data resident in SCM survives
  - In-flight memory operations may fail
  - Automic updates either complete or do not modify memory
Persistent regions
- Achieve 1st goal: User-mode access to persistence
- A segment of memory that is created and virtualized by the kernel but can be accessed directly from user mode
- Virtualize regions by
  - Recording the virtual-physical mapping of persistent regions in SCM
  - Swapping SCM pages to backing files that it allocates when creating a region
    - Prevent memory leaks
  - Requires persistent pointer to receive memory
  - Virtualizes persistent memory by swapping to files (leak does not reduce availability of persistent memory to other programs)
Consistent updates
- Primary mechanism to ensure consistency: ordering writes
- Four methods
  - Single variable update (atomically writing to a single variable)
  - Append updates (log)
  - Shadow updates (copy-on-write)
  - In-place updates (in-place updates B-tree)
Persistent Primitives
- Achieve 2nd goal: Consistent updates
- Low-level operations that enable programmers to implement four consistency methods
- hardware primitives for persistent write and ordering write (Single variable update)
- Log facility (append-only updates)
- Persistent heap for allocating small blocks of memory (shadow updates)
Durable memory transactions
- Achieve 2nd goal: Consistent updates
- Support in-place updates
- Use compiler to convert C/C++ code into transactions to ensure atomicity, durability, and isolation (=> transactions allow concurrent update to different data structures)
Hardware Primitives
- Achieve 3rd goal: Conventional hardware
- Use three hardware primitives from processors
  - write-through stores: write data directly to memory rather than to the cache
  - fences: prevent subsequent writes from completing before preceding writes
  - flushes: writes a cahce line out to memory
Architecture

Implementation highlights
- Use tornbit in raw word log (RAWL) to use only one fence to solve "torn write" problem (compares with two fence & checksum approach, better performance)

"Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing"

2019-05-06T20:24:00+08:00

Problem
System Design
- Goal
- Resilient Distributed Datasets (RDD)

Problem

How to design a system that support in-memory computation (with real-time interaction support) in large cluster efficiently with fault-tolerance?

Prior systems are lack of abstraction for leveraging distributed memory (intermediate computing result reuse is problem: has to save and then read through storage system)
Prior systems on data reuse do not have abstraction for general use (i.e., limited computatin pattern supported) and do not support real-time interaction (e.g., load data sets into memory and query them interactively)

System Design

Goal

Provide distribued memory abstractions for clusters to support apps with working sets
Retain the attactive properties of MapReduce:
- Fault tolerance (for crashes & stragglers)
- Data locality
- Scalability

Resilient Distributed Datasets (RDD)

Generality conjecture: Spark's data flow + RDDs unifies many proposed cluster programming models (MapReduce, Dryad, SQL, Pregel (BSP), HaLoop (iterative MR), Continuous Bulk Processing)
An RDD is a read-only, partitioned, logical collection of records
- Need not be materialized, but rather contains information to rebuild a dataset from stable storage
Coarse-grained transformations (e.g., map, filter, join)
- efficiently fault-tolerant: logging the transformations used to build a dataset (its lineage) rather than the actual data
- vs. actions: operations that return a value to the application or export data to a storage system (e.g., count, collect, save)
Created through transformations on 1) data in storage 2) other RDDs
User can specify which RDD to reuse and how to store it
User can partition RDD across machines on a key
vs. distributed shared memory (DSM)

Can use persist method to indicate which RDDs the user want to reuse in future operations and specify persistence priority (e.g., which in-memory data should store to disk first) and persistence strategy (e.g., store RDD on disk instead of memory; replication)
Target application: iterative algorithms and iterative data mining tools
Limitation: not suitable for applications that make asynchronous fine-grained updates to shared state (e.g., storage system for a web app or an incremental web crawler)
Representing RDD by exposing information:
- parititions of datasets
- dependencies on parent RDDs (narrow vs. wide) (see figure below)
- partition-based function for computing the dataset
- metadata on partitioning shceme and data placement

"GPUfs: Integrating a File System with GPUs"

2019-05-06T12:20:00+08:00

Problem
Background
System Design

Problem

GPU data access needs explicit management by each individual GPU program. How can we provide an illusion similar to VM to physical pages such that all the data access can be done automatically (i.e., without stating that how data should be fetched from host storage explicitly)? Clearly, FS is a good abstraction but how can we adapt such interface to GPU application needs while maintaining good performance?

Background

Motivation
- GPUs have no direct access to files on the host OS file system; developers manage data movement explicitly between CPU main memory and GPU local memory
- Offloading computations to GPU (common application) under GPU-as-coprocessor programming model introduces overhead due to the GPU kernel start/termination
About GPU
- GPUs are multicore processors
- Each core called multiprocessor (MP), features a wide SIMD vector unit, which a hardware scheduler multiplexes between multiple execution contexts
- Thread: a GPU's basic sequential unit of execution (higher execution units: warps, threadblock, kernel)

System Design

Application point of view
Architecture:
- GPU programs can access the host's file system via a GPUfs library linked into the application's GPU code
- GPUfs library works with the host OS on the CPU to coordinate the file system's namespace and data
Design principles:
- Data parallelism
  - All application threads in a warp must invoke the same GPUfs call, with the same arguments, at the same point in application code
  - Minimize per-open file state (e.g., remove seek pointers)
  - Separate sync from close (i.e., close call will not trigger sync)
  - Constrain mmap semantics to avoid the need for complex memory management on critical data parallel paths
- Access locality
  - Implements a weak consistency model similar to the private workspace model in AFS (i.e., sync-to-open semantics)
  - Local file modifications propagate to main CPU memory only when the application explicitly sync the file with storage.
  - Modification is visible to other GPUs during re-open the file
  - Allow concurrent non-overlapping writes to the same file (i.e., undefined results for overlapping writes)
API design highlights
- Open:
  - All GPUfs threads opening the same file obtain a single shared file descriptor (increments reference counts on file descriptor)
- Read and write:
  - File descriptors have no seek pointers
- Close and sync:
  - glcose does not propage locally writen data back to the CPU, or to other GPUs, until the application explicitly synchronize file data by calling sync
- File mapping:
  - No guarantee to map the entire requested file region
  - No guarantee on mapping at a particular address (MMAP_FIXED unsupport)
  - No guarantee returns the requested permission (ask read may return read/write)
  - Benefits: allow GPUfs to give the application pointers directly into GPU-local buffer cache pages, residing the same address space as the application's GPU code
Implementation
- Top layer:
  - Runs in the context of the application's GPU kernels and maintains its data structures in GPU memory
  - Implements the GPUfs API, tracks open file state, and implements buffer cache and paging
- Communication layer:
  - Manages GPU-CPU communications
  - Data structures shared between the GPU and CPU are stored in write-shared CPU memory accessible to both devices
  - Implements a GPU-CPU RPC infrastructure (need hardware support: GPU-CPU memory fences; GPU cahce bypass)
- Consistency layer:
  - OS kernel module running on the host CPU, which manages consistency between the host OS's CPU buffer cache and the GPU buffer caches
On-demand data transfer
Pros
- Simple CPU code
- Handle data that is greater than GPU memory (use buffer cache)
- Enable long-running kernels
- Pay-as-you-go performance
Cons
- Memory overhead
- Register pressure
- Idiosyncratic API

"Dandelion: a Compiler and Runtime for Heterogeneous Systems"

2019-05-06T10:00:00+08:00

Problem
System Design
- Overview
- Architecture
  - Dandelion Compilers
  - Dandelion Runtime

Problem

How to design a system that provides programmability for heterogeneous distributed systems?

Challenges are
- Heterogeneous: different programming models, architecture expertise
- Distributed resources: data movement, scheduling
- Concurrency: synchronization, consistency

System Design

Overview

Goals: make it simple for programmers to write high performance applications for hetergeneous system on a small cluster with GPUs and leverage its available resources
- Single programming interface for clusters with CPUs, GPUs, FPGAs, etc
- "Single machine" programming model: programmer writes sequential code
- Runtime: take sequential program and do the following whenever possible
  - Parallize computation
  - Partition data
  - Runs on all available resources
  - Maps computation to best architecture
Workflow
1. Given User program & partitioned data files as input
2. Compile to a mix of CPU and GPU code and run on the cluster (Dandelion role)
3. Output result as partitioned data files
Challenges
- Simple programming model
- Integerate multiple runtime efficiently to enable high performance
Dandelion innovation by levels
- Programming languages
  - Adopt language integration approach (LINQ): extends with Dandelion specific operators
  - Constraints: UDF must be side-effect free; execute .NET function with dynamic memory allocation on CPUs only
- Compilers
  - Automatically compiles a data-parallel program to run on distributed heterogeneous systems
  - Translation of .NET byte-code to multiple backends (e.g., GPU, FPGA)
- Distributed and parallel runtime
  - Treat hetergeneous system as the composition of a collection dataflow engines
  - Three dataflow engines: cluster, mult-core CPU, and GPU (e.g., use PTask)

Architecture

Two main components
- Dandelion compiler generates the execution plans and the worker code to be run on the CPUs and GPUs of cluster machines
- Dandelion runtime uses the execution plans to manage the computation on the cluster (e.g., scheduling, distribution)

Dandelion Compilers

Dandelion compiler generates CUDA code, and three levels of dataflow graphs to orchestrate the execution
Relies on a library of generic primitives (GPU primitive library) to construct execution plans
- GPU primitive library: for GPU dataflow graph
  - primitives include low level building blocks (e.g., parallel scan and hash tables), high-level primitives for relational operators (e.g., groupby and join)
Compiling C# to GPU code
- Translation performed at .NET byte-code level
  - Map C# types to CUDA structs
  - Translate C# methods into CUDA kernel functions
  - Generate C# code for CPU-GPU serialization/transfer
- Main constraint: dynamic memory allocation
  - Convert to stack allocation if object size can be inferred
  - Fail parallelization, fallback to host otherwise
- Use Common Compiler Infrastructure (CGI) framework for the intermediate AST

Dandelion Runtime

Dataflow graphs
- Vertex: a fragment of the computation
- Edge: communication channels
- Three levels:
  - cluster level (what machine to compute what): cluster execution engine assigns vertices to available machines and distributes code and graphs, orchestrating the computation
  - machine level (how the computation is done on each machine): executes its own dataflow graph, managing input/output and execution threads
  - GPU level (use PTask as GPU dataflow engine)
Three dataflow graphs (shown above) form the Dandelion runtime, and the composition of those graphs forms the global dataflow graph for the entire computation
Cluster dataflow engine ("Moxie")
- Allows the entire computation to stay in memory when the aggregate cluster memory is sufficient (assume work on a small cluster with powerful machines with GPUs)
- Holds intermediate data in memory and can checkpoint them to disk (like Spark)
- Aggressively caches in-memory datasets (including intermediate data)
- Uses async. checkpoints to support coarse-grained fault tolerance
Machine dataflow engine
- Vertex: a unit of computation can be executed on CPU or GPU
- For CPU: parallelize the computation on multi-core
- For GPU: dispatch computation to GPU dataflow engine
- Async channels are created to transfer data between the CPU and GPU memory spacs
GPU dataflow engine
- Use PTask execution engine
  - a token model dataflow system
  - tasks: computation or nodes in the graph
  - ports: inputs and outputs of the tasks
  - Channels: connects ports
  - datablocks: deseralized data into chunks that are moved through channels
  - Computation are done by pushing and pulling datablocks to and from channels (Future)
  - A task is ready for execution when all of its input ports have available datablocks and all of its output ports have capacity

"Operating System Transactions"

2019-04-27T12:20:00+08:00

Problem
Background
System Design

Problem

How to implement transaction with ACID guarantee in OS to provide concurrency control?

Background

Problems that can be solved by using transactions:
- Security vulnerbilities in the file system that are caused by time-of-check-to-time-of-use (TOCTTOU) race conditions
- Unsuccessful software installation is hard to roll back without disturbing concurrent, independent updates to the file system
- Consistency problem in managing local user accounts
- Have to use a database for concurrency management and crash consistency, which lead to more complex application code and system administration

System Design

Overview

System transactions allow programmers to group accesses to system resources into logical units that are executed with ACID guarantee
Use sys_xbegin() and sys_xend() to wrap around code regions with consistency constraints
Use sys_xabort() to abort in-progress transaction
Access and modify to system resources are kept isolated until commit time

System transactions

System transactions provide ACID semantics for updates to OS resources (e.g., files, pipes, and signals)
OS is responsible for ensuring that transactional & non-transactional access to system state are correctly serialized and contention is arbitrated fairly
To ensure isolation, kernel enforces the invariant that a kernel object may only have one writer at a time (except containers)
- Concurrent system transactions that modify the same kernel object cannot commit (i.e., one of them has to abort)
- Non-transactional updates to objects read or written by an active system transaction are also forbidden
Durability is optional (for performance concern)
Provides strong isolation: serialization of transactional and non-transactional updates to the same resources
OS always chooses the same transaction to restart when hit the conflict that involves the same transactions (to prevent livelock)
ACID is guaranteed for system state, not application state
Communication outside of a trasnaction violates isolation and thus not supported (e.g., IPC between a thread executing a system transaction and a non-transactional thread)
Communication among threads within the same transaction is unrestricted

TxOS

Design

High-level overview:
- Application updates usually go to OS buffer first; copy-on-write these buffers for transactions to support isolation
- Transactional updates to kernel data structures are supported through object-based software transactional memory systems
- Isolation mechanism is optimistic: assume conflicts are rare
Interoperability and fairness
- TxOS prvoides strong isolation inside the kernel by requiring all system calls to follow the same locking discipline and transactions need to anonotate accessed kernel objects (all threads, whether transactional or not, need to check for conflict on 1st time access)
- Scheduler can maintain fairness between non-transactional and transactional threads b/c conflict is detected before threads entering a critical region and suspension of non-transactional thread is possible
Managing transactional state
- Lazy version management: transactions operate on private copies of a data structure (vs. eager version management: in-place data update + undo log)
- TxOS holds lock only to make a private copy of the data structure (enforcing global ordering of kernel locks avoid deadlock)
- Commit latency is challenge for lazy version management: TxOS minimizes this overhead by splitting objects, tunning a memcpy of the entire object into a pointer copy
TxOS provides transaction to system states; User-level transactional memory (TM) system provides transaction to application states (integrate together to provide a complete transactional programming model)

Implementation

Versioning data:
- TxOS maintains multiple versions of kernel data structures to isolate the system calls effect until transactions commit and easy abort
- Split objects into header and data so that TxOS maintains the invariant: single writer to any object (i.e., concurrent writes must be on disjoint objects) (see Figure 2 below)
  - Object header contains a pointer to object's data; transactions commit changes to an object by replacing this pointer in the header to a modified copy of data object
  - Header itself is never replaced by a transaction (eliminates the need to update pointers in other objects)
  - Pros:
    - No need to recursively update pointers in other objects
    - Avoid restarting active transactions tiggered by the kernel garbage collection thread
    - Ensure that transactional code always has a speculative object
- TxOS decomposes an object into multiple data payloads when it houses data that can be accessed disjointly
- No copies are made for kernel objects that are read-only in transactions
  - Writer creates a new object copy when transactional reader reference count is non-zero and install it as new stable version
  - Old copy is collected via read-copy update (RCU)

Conflict detection and resolution
- tx_data object (part of transaction-supported kernel objects) contains a pointer to a transactional writer list and a pointer to a transactional reader list (non-null indicates there is an active treansactional writer/reader)
- Use locks and test on the transactional writers and readers fileds to detect transactional and asymmetric (conflict involve transaction and non-transaction threads) conflicts
- Selects the process with the higher scheduling priority as the winnder of a conflict (if priority equal, older transaction wins)
- Non-transactional thread is preempted when asymmetric conflict is detected and rescheduled when the conflicting transaction is commited
- Table 3 shows the transactional list state that minimizes the conflicts involving updating linked list data structure

Managing transaction state
- Use transaction objects to store metadata and statistics for a transaction; let kernel thread's control block pointing to it
- Transaction status (status) is checked and updated atomically
- Abort is invoked when conflict is detected during transactional system call:
  - Transaction stores register state on the stack at the beginning of the current system call
  - On abort, register state is restored and execution is jumped back to the top of kernel stack
- Need to keep track of deferred operations (done until the commit time) (e.g., free memory, deliver signals, file system monitoring events)
- A workset maintains references to all kernel objects that transaction has made a copy of

Commit protocol
1. Transaction acquires locks for all items in its workset
2. TxOS iterates over the objects in workset twice: 1) acquire the blocking locks 2) acquire non-blocking locks
3. Transaction checks the status and see if it can commit
4. Transaction copies its updates to the stable objects
5. Release spinlocks
6. Perform deferred operations
7. Release mutex

Abort protocol
1. Release any kernel locks
2. Iterate workset and locks each objects to remove any references to itself from the object's transactional state and then unlocks the object
3. Transaction frees its shadow objects and decrements the reference count on their stable counterparts
4. Release any resources used in transaction (e.g., memory allocation)
User-level transactions
- Commit protocol
  1. The user prepares a Transaction
  2. The user requests that the system commit the transaction through sys_xend()
  3. System commits or aborts
  4. System communicates the result to user as sys_xend() return code
  5. User commits or aborts accordingly
multi-process transactions, and signal delivery can be checked in paper

"Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism"

2019-04-13T12:20:00+08:00

Problem
Background
System Design
Reference

Problem

User-level library
- Management in application's address space
- High performance and very flexible
- Lack functionality
Operating system kernel
- Poor performance (compared with user-level threads)
- Poor flexibility
- High functionality

How to design a parallelism mechanism (e.g., kernel interface + use-level thread package) that combines the functionality of kernel threads with the performance and flexibility of user-level threads?

Background

Theme for supporting concurrent and parallel programming

Conform to application semantics
Respect priorities of applications
No unnecessary blocking
Fast context switch
High processor utilization

"Heavyweight" Process Model

Pros:
- Simple, uni-threaded model
- Security provided by address space boundaries
Cons:
- High cost for context switch
- Coarse granualarity limits degree of concurrency

"Lightweight" User-level Threads

Pros:
- Thread semantics defined by application: different applications can be linked with different user-level thread libraries
- Fast context switch time: within an order of magnitude of procedure call time
- No kernel intervention (i.e., high performance)
- Good scheduling policy flexibility: done by the thread lib
Cons:
- Unnecessary blocking: A blocking system call, I/O, page faults, and multiprogramming block all threads (i.e., lack of functionality)
- System scheduler unaware of user thread priorities
- Processor under-utilization (i.e., Hard to run as many threads as CPUs) because: Don't know how many CPUs; Don't know when a thread blocks

Kernel Threads

Pros:
- No system integration problems (system calls can be blocking calls) (i.e., high functionality)
  - Handle blocking system calls/page faults well
- Threads are seen and scheduled only by the kernel
Cons:
- Adds many user/kernel crossings (i.e., low performance) (e.g., thread switch, create, exit, lock, signal, wait, ...)
  - Typically 10x-30x slower than user threads
- Every thread-related call traps: etra kernel trap and copy and check of all parameters on all thread operations
- Single, general purpose scheduling algorithm (i.e., lack of flexibility)
- Thread semantics defined by system
- Context switch time better than process switch time by an order of magnitude, but an order of magnitude worse than user-level threads
- System scheduler unaware of user thread state (e.g., in critical section) leading to blocking and lower processor utilization

User-level threads multiplexed on kernel threads

Question: Can we accomplish system integration by implementing user-level threads on top of kernel threads?
No:
- Different apps have different needs (thread priorities, etc)
- Insufficient visibility between the kernel and user thread library
  - Kernel doesn't know best thread to run
  - Kernel doesn't know about user-level locks, priority inversion (preempt while in critical section): too much info changing too quickly to notify kernel
  - kernel events (preemption, I/O) invisible to user library (user thread blocks, then kernel thread serving it also blocks)
- Kernel threads are scheduled obliviously w.r.t user-level thread library
- Hard to keep same number of kthreads as CPUs
  - Neither kernel nor user knows how many runnable threads
  - User doesn't even know number of CPUs available
- Can have deadlock

System Design

Key problem:
- Application has knowledge of the user-level thread state but has little knowledge of or influence over critical kernel-level events (this is by design to achieve the virtual machine abstraction)
- Kernel has inadequate knowledge of user-level thread state to make optimal scheduling decisions
  - Kernel may de-scehdule a thread at a bad time (e.g., while holding a lock)
  - Application may need more or less computing
Solution:
- A mechanism that facilitates exchange of information between user-level and kernel-level mechanisms

Note

A general system design problem: communicating information and control across layer boundaries while preserving the inherent advantages of layering, abstraction, and virtualization.

What is a SA?
- Execution context for running user-level threads
- Notifies the user-level thread system of kernel event
- Provides space for the kernel to save processor context

Note

A scheduler activation is the execution context for vectoring control from the kernel to the address space on a kernel event. The address space thread scheduler uses this context to handle the event, e.g., to modify user-level thread data structures, to execute user-level threads, and to make requests of the kernel.

Scheduler Activations (SA) structure

SA basics
- Multi-threaded programs given an address space (as usual)
- Facilitate flow of key information between user space and kernel space
- Kernel explicitly "vectors" kernel events to the user-level thread via SA
- Extended kernel interface for processor allocation-related events
  - User-level thread library notifies the kernel about events
  - Kernel uses the SA itself to do the same
- SA has two execution stacks
  - The kernel stack - used by the user-level thread when running in the kernel mode (e.g., system call)
  - The user stack - used by the user-level thread scheduler
- Each user-level thread is given a separate stack so that the thread scheduler can resume running if a user-level thread blocks
- The kernel-level SA communicates with the user-level library by upcalls
- When must kernel call into user-space? (Table II)
  - New processor available
  - Processor had been preempted
  - Thread has blocked
  - Thread has unblocked
- When must user call into kernel? (Table III)
  - Need more CPUs
  - CPU is idle
  - Preempt thread another CPU (for higher priority thread)
  - Return unused SA for recycling (after user-level thread system has extracted necessary state)
- SA role: there is one running SA for each processor assigned to the user process

SA lifecycle
- On program start:
  - New SA is created, assigned a processor and "upcalled" (fixed entry point)
  - User-level thread scheduler initializes and runs on this SA
- Kernel uses SA to notify the user-level about important events: preemption, I/O, page faults
Avoiding effects of blocking

Resume blocked thread

I/O request/completion

Responsibility division between kernel and application address space:
- Processor allocation (the allocation of processors to address space) is done by the kernel.
- Thread scheduling (the assignment of an address space's threads to its processors) is done by each address space.
- The kernel notifies the address space thread scheduler of every event affecting the address space.
- The address space notifies the kernel of the subset of user-level events that can affect processor allocation decisions.
SA vs. kernel threads key differences
- Preempted threads never resumed by the kernel directly (rather, indirectly through an SA)
- A traditional kernel:
  - Directly resumes the kernel thread
  - Does NOT notify the user-thread about preemption
  - Does NOT notify the user-thread about resumption
Critical section
- Problem: threads preempted while holding a lock, which can lead to deadlock
  - User-level thread holds lock on the program's ready list
  - Gets preempted
  - Thread scheduler tries put that thread in the ready list on SA upcall
- Use recovery (recover when it does)
  - Thread scheduler checks to see if the thread was executing in a critical section
  - If so, the thread is continued temporarily via a user-level context switch. When the continued thread exits the critical section, it relinquishes control back to the original upcall, again via a user-level context switch.
Others
- Page faults must be handled carefully: the kernel must notify the program of a page fault only when the page fault is serviced
- User-level thread after blocking might still execute in kernel (e.g., I/O completing): the kernel notifies the user-level only after the user thread is in a "safe" point
- Every SA needs a processor to do the up-call on: at time T3 of I/O image, when I/O completes, kernel must notify the user-level thread of the event via SA, and this notification requires a processor
- Application free to build any concurrency model on SAs

Reference

"Jitsu: Just-In-Time Summoning of Unikernels"

2019-03-13T21:20:00+08:00

Problem
Background
System Design

Problem

How to build a system that is able to securely manage multi-tenant networked applications on embedded infrastructure?

Goals:
- High density/scalability
- Fast boot
- Lightweight
- VM-level isolation
- "Embedded cloud"

Background

OS are traditionally designed to run on a wide range of hardware, and support a variety of applications. But no longer true!
- Hypervisors in the cloud provides virtual hardware abstractions
- Many modern applications are single purpose microservices
Container:
- Think of as a lightweight VM
- Separate process space, network interface
- Setuid/root access possible
- Share kernel with host (thus, no I/O emulation, VM overheads)
- chroot, cgroups
- Pros:
  - Achieves much of VM charter
  - Separation of concerns: Dev (inside container), Ops (outside)
  - Lightweight, good deployment unit
- Cons:
  - Limited compatibility
  - Limited isolation
- Container vs. VM:

Problem with layers in existing solution:
- Complex configuration management
- Duplication leads to inefficiency
- Image size leads to long boot time
- All the layer leads to large attack surface

Unikernels
- Pros:
  - Lightweight (fast, IoT-amenable)
  - High consolidation ratios
  - Small attack surface
  - Type safety (safety in general)
  - Minimize multi-RM pathologies
  - Small binaries (host in git)
- Cons:
  - Increased pressure on (cloud) scheduler
  - Threading
  - Cross-domain communication
  - Compatibility
- Mirage Unikernel:
  - OS is a collection of modules (libs) with types (API)
  - Written in OCaml
  - Compact enough to boot/respond to network traffic in real-time

System Design

Jitsu: Unikernels on demand
- Capture system dependencies in code/compile them away
- Swap system libraries to target different platforms
- Dev/Test on UNIX, deploy specializes to Xen

Jitsu Architecture:

"Rethinking the Library OS from the Top Down"

2019-03-13T20:20:00+08:00

Problem
Background
System Design

Problem

How can we refactor a commerical OS to follow libOS architecture and achieve a better performance than VMM approach?

Background

Three categories of services in OS implementations:

Why does a guest OS need a kernel? Because the host interface is virtual hardware

System Design

Goals:
- Compatibility: Runs applications you use
- Lightweight: <1% of Windows library code
- Performance: 10x to 20x lower overheads than a VM
- Security: Secure isolation comparable to VM
- Mobility: Migrate running applications
- Generality: Independent evolution of host OS
- Manageability: Smaller "servicing" area
Hypothesis: it's possible to design a software ABI with the same properties as hardware ABI:
- Clearly specified, clean separation of concerns (No undocumented dependencies)
- Minimally stateful:
  - Registers, etc. are visible to guest OS/application
  - State can be programmatically recreated
  - Analogous to a stateless network protocol
Guest/Host ABI:
- Private virtual memory
- Threads, synchronization
- I/O streams
- Thread/process exit
- Time, random bits, handle reference management, checkpoint/restore
Refactoring the Desktop:
- Host OS manages hardware
- Application services in library
- Desktop manager
  - Trusted host process
  - Remote Desktop protocol
  - Stateless
  - Shell can be remote

Drawbridge Architecture:

Limitations:
- Incomplete port of windows API
  - Printer support
  - Accelerated graphics
- Support for multi-process applications (e.g., Outlook with Word as an editor, sharing state through windows subsystem)
- Administrative tools will not work by design
  - Need more low-level system access
End results:
- Refactored Windows 7 as a Library OS (80MB)
- Functional benefits of VMs:
  - Robust to changes in host system software
  - Security isolation
  - Migration
- Drastically better scalability
- Run rich desktop applications

"Arrakis: The Operating System is the Control Plane"

2019-03-13T17:20:00+08:00

Problem
Background
System Design
Remarks

Problem

How can we design an OS for I/O intensive applications such that most I/O operations do not need kernel mediation?

Background

The authors make a classic efficiency argument: servers usually perform conceptually simple operations, but in practice this results in too much operating system overhead.

System Design

The new secret sauce is that hardware device virtualization allows user-level programs to get efficient access to I/O without compromising protection.
I/O centric design:
- Bypass kernel
- Abstractions: user-space device access
- Single-Root I/O Virtualization (SR-IOV) is the hardware secret sauce. It allows software to setup flexible hardware multiplexing.
Kernel functionality re-divide in Arrakis:

Architecture:

Hardware Model:
- NICs (Multiplexing, Protection, Scheduling)
- Storage
  - VSIC (Virtual Storage Interface Controller): each with queues, etc
  - VSA (Virtual Storage Areas): mapped to physical devices; associated with VSICs; VSA & VSIC: many-to-many mapping
Control Plane Interface:
- VIC (Virtual Interface Card): Apps can create/delete VICs, associate them to doorbells
- doorbells associated with events on VICs
- filter creation (e.g., create_filter(rx,*,tcp.port == 80))
- Features:
  - Access control: enforced by filters; infrequentily invoked (during setup, etc)
  - Resource limiting: send commands to hardware I/O schedulers
  - Naming: VFS in kernel; actual storage implemented in apps
Network Data Interface:
- Apps send/receive directly through sets of queues
- Filters applied for multiplexing
- Doorbell used for asynchronous notification (e.g., packet arrival)
- Both native (w/ zero-copy) and POSIX are implemented
Storage Data Interface:
- VSA supports read, write, flush
- persistent data structure (log, queue)
  - operations immediately persistent on disk
  - eliminate marshaling (layout in memory = in disk)
  - data structure specific caching & early allocation

Remarks

This paper exploits the wide variety of hardware. Some of them do not even exist or are quite limited (e.g., VSIC support). Personally, I'm not a huge fun of this type of research as the main goal seems to be "let's delegate OS work to the hardware". But, it indeed lays out a vision that what OS can be if hardware is powerful enough and the applications on top of OS are not quite complex and do not need much kernel involvement (e.g., scheduling).
The presentation of motivation through rigorous measurement (Table 1 and Table 2) is something I really enjoy. In addition, the detailed analysis of the overhead involving kernel (Background section) has much useful information to further study for myself.
Pros: much better raw performance (for I/O intensive Data Center apps)
- Redis: up to 9x throughput and 81% speedup
- Memcached: scales up to 3x throughput
Cons:
- Some features require hardware functionality that is not yet available
- require modification of applications
- not clear about storage abstractions
- not easy to track behaviors inside the hardware

"Memory Resource Management in VMware ESX Server"

2019-03-11T23:24:00+08:00

Problem
Background
System Design
Remarks

Problem

How to design a memory management system inside VMM to manage memory allocated to each guest OS (i.e., VM)? This is challenging as each guest OS also has its own resource manager and how to ensure performance isolation across all VMs.

Background

VMWare ESX Server runs on bare hardware (compared with VMM runs on a control OS like KVM); In "control OS" approach, here is the workflow of reading a file from an application in VM:
- an application read from a file (in VM)
- OS translates request into a read from block device (in VM)
- VMM translates that to a read from a file from host OS (in VMM)
- host OS translates request to a read from the machine's block device (in control OS)
Problems with previous memory management approach:
- VMM page replacement algorithm can pick a page important to guest OS. Causes performance anomalies.
- Double paging problem: If VMM pages out first, OS page out the same page will cause VMM fault in the same page from system paging device just to write out to the virtual paging device.
- Constraint: OS does not have a facility for changing amount of physical memory at runtime

System Design

terms:
- machine address: actual hardware memory
- "physical" address: software abstraction used to provide illusion of hardware memory to a virtual machine
- shadow page table: virtual-to-machine page mappings
- overcommitment: the total size configured for all running virtual machines exceeds the total amount of actual machine memory

Ballooning

Force guest OS to use its own page replacement algorithm

Download VMWare balloon module into guest OS as a pseudo-device driver or kernel service
- no external interface within the guest
- communicate with ESX server via a private channel
If ESX wants to reclaim memory, it instructs the driver to "inflate" by allocating pinned physical pages within VM and the memory pressure in guest OS force the guest OS to free memory
pinned physical pages used by balloon are told to ESX so that ESX can reclaim corresponding machine pages (i.e., pages allocated to balloon have their entry in the pmap marked, can be reclaimed by VMM)
Defalte balloon to get OS to use more memory
If guest touches a balloon page, allocate a new page (machine page stealed from guest OS has to be returned)
Problems:
- Might not be able to reclaim memory fast enough
- Guest OS might refuse memory allocation request or limit the driver's ability to allocate memory
Can always resort to paging. Using randomize algorithm to avoid pathologically bad cases of paging out exactly what guest OS needs

Able to share a page without modification to guest OS (compared to Disco's approach)

Hash every page, store hashes in a hash table
On collision, check if pages are identical. If they are, share copy-on-write
With no collision, store hash as hint. On future collision check if hint is still valid (page content have not changed). If it is, share page.

Managing Memory with Taxes

The problem with proportional share allocators is that they let rich clients hoard resources. VMWare adds a tax to idle memory.

The basic idea is to charge a client more for an idle page than for one it is actively using
Tax rate specifies the maximum fraction of idle pages that may be reclaimed from a client
Inflate cost of idle memory by tax rate
Allow 25% idle memory to provide a buffer for a fast-growing working set increase
Only need percentage of idle memory: measure by random sampling

Others

Actions to reclaim memory when percentage of free memory hits certain level:
- high (6%): no reclamation
- soft (4%): balloon, page when nencessary (share before swap)
- hard (2%): page
- low (1%): suspend VM
I/O Page remapping:
- overcommitment may lead to problem that guest OS may address machine page that doesn't exist (e.g., DMA on 32-bit access pages above 4GB boundary ("high" memory))
- fix: keep track of "hotness" of pages in high memory and when the I/O refernce count of such pages exceed certain threshold, they are remapped to low memory. May remap low memory pages to high memory pages to get more space for "hotness" remapping

Remarks

Another well-written paper. Each technique is explained very clearly. Its writing style is quite unique as well: evaluation is divided and associated with relvant technique introduced. I personally like this style better as it gives me some break between each major point.

"Xen and the Art of Virtualization"

2019-03-11T23:13:00+08:00

Problem
Background
System Design
Further reading

Problem

Provide a high performance resource-managed virtual machine monitor (VMM) that provides performance guarantees to concurrent execution of multiple operating systems: "hosting up to 100 virtual machine in- stances simultaneously on a modern server"

Background

Big picture

Two types of VMMs

Virtualization techniques ¹
- Fidelity: A program running under the VMM should exhibit a behavior essentially identical to that demonstrated when running on an equivalent machine directly.
- Interposition: All guests actions go through monitor; monitor can inspect, modify, deny operations (e.g., compression, encryption, profiling, translation)

Full virtualization is slow:
- VMWare's ESX Server dynamically rewrites portions of the hosted machine code to insert traps whenever VMM intervention might be required. This applies to entire guest OS as all non-trapping privileged instructions must be caught and handled.
- ESX Server maintains shadow page table and to maintain consistency with virtual tables, it traps every update.

System Design

Paravirtualization: idealized machine, efficient to virtualize
- More efficient than "full" virtualization
- Cost: need to modify OS
For safety: Xen exists in a 64MB section at the top of every address space, thus avoiding a TLB flush when entering and leaving hypervisor
CPU:
- X86 supports 4 privilege levels: Without Xen, 0 for OS, and 3 for applications; Xen downgrades OS to level 1, and it runs level 0
- Syscall and page-fault handlers: registered to Xen; "fast handlers" most exceptions, don't invole Xen
Paravirtualization techniques:
- Run VMM at ring 0, OS at ring 1 (app stays at ring 3)
- System calls vector directly to guest OS without VMM involvement. Validate handler at install time.
- Page fault handler doesn't read cr2 to get faulting address, put it in stack frame. VMM must execute to read cr2
- Mappings validated when page tables written (same as exokernel)
- Updates to page table are batched and validated in bulk. Avoiding interrupt-like updates is an important technique.
- Type and reference count for each physical frame (PD, PT, LDT, GDT, RW)
- Hardware physical to machine memory mapping readable by all VMs.
  - Needed by guest OS for writing page table, and useful for superpages or cache coloring.
- VMs have access to both real and virtual time.
- All devices use shared-memory asynchronus buffer-descriptor rings (a batch interface)
- Interrupts replaced with event delivery bitmap.
  - Events can be held off like disabling interrupts.
  - Some control over notification granularity, allowing latency/bandwidth tradeoffs (e.g., notify for every packet, or every 16 packets)
- I/O requests have a unique ID and can be reordered
  - E.g., Guest OS and Xen can schedule the disk arm
  - But guest can pass a reorder barrier to prevent some reordering (e.g., for file system consistency)
- OS makes hypercalls to VMM (e.g., install page table entries)
Other important ideas:
- Domains are virtual machines:
  - Domain 0 provides the administrative functions of the VMM (keeps complexity out of the VMM proper)
  - Domain 0 contains the real device drivers (domain 0 is the target of malware attacks)
- Virtual network devices (VIFs) may filter packets to prevent spoofing
- Memory is paritioned across domains. A domain provides memory for I/O operations. 1 page per packet (wow).

"Practical, transparent operating system support for superpages"

2019-03-10T15:00:00+08:00

Problem
Superpage definition and benefits
Superpage issues
System Designs
Remarks

Problem

How to design a general and transparent superpage management system to achieve high and sustained performance for real workloads and negligible degradation in pathological situations?

Superpage definition and benefits

Superpage definition:
- Memory pages of larger sizes (e.g., 8KB, 64KB, 512KB, 4MB)
- The rest is the same as the normal pages:
  - power of 2 size
  - use only one TLB entry
  - contiguous
  - aligned (physically and virtually)
  - uniform protection attributes
  - one reference bit, one dirty bit
Superpage improves TLB coverage (i.e., the amount of memory accessible through cached mappings. In other words, without incurrin misses in the TLB)

Superpage issues

Allocation: how/when/what size to allocate?

Promotion (create a superpage out of a set of smaller pages)
- Wait for app to touch pages (the gray ones) may lose opportunities to increase TLB coverage
- Create small superpage (the first four blue ones) may waste overhead
- Forcibly populate pages (i.e., group all eight base pages into one superpage) may lead to internal fragmentation

Demotion (convert a superpage into smaller pages). Happens when:
- page attributes of base pages of a superpage become non-uniform (e.g., some of the base pages protection attributes changed and different from the rest within the superpage)
- during partial pageouts
Fragmentation
- Memory becomes fragmentated due to:
  - use of multiple page sizes
  - persistence of file cache pages
  - scattered wired (non-pageable) pages
- Continuity: contended resource
  - OS must trade off impact of contiguity restoration against superpage benefits

System Designs

Allocation

Determines a preferred superpage size for the region encompassing the base page whose access caused the page fault
The entire set of frames (gray ones above) is tentatively reserved for potential future use as a superpage, and added to a reservation list.
reservation size:
- memory objects fixed in size (e.g., code, file): largest, aligned superpage (contains the faulting page) + doesn't overlap with existing reservations or allocated pages + no larger than object
- memory objects with dynamic size (e.g., heap): everything same as above except not restricting no larger than object
- If size is not available, preempt existing reservation or resign to a smaller size
preemption policy: the reservation is preempted whose most recent page allocation occured least recently, among all candidate reservations.
- Use reservation list:
  - keep track of reserved page frame extents that are not fully-populated
  - One reservation list for each page size supported by the hardware except for the largest superpage size
  - Reservation in each list are kept sorted by the time of their most recent page frame allocations. When the system decides to preempt a reservation of a given size, it chooses the reservation at the head of the list for that size.

Incremental Promotion

Promote only regions that are fully populated by the application
Promotion occurs to the smallest superpage size as soon as the population count corresponds to that size. Then, when the population count reaches the enxt larger superpage size, another promotion occurs to the next size, and so on.

Demotions

Speculative demotions:
- One reference bit per superpage: how do we detect portions of a superpage not referenced anymore?
- Solution: On memory pressure, demote superpages when resetting reference bit; Re-promote (incrementally) as pages are referenced

Fragmentation Control

Low contiguity: modified page daemon
- restore contiguity: move clean, inactive pages to the free list
- minimize impact: prefer pages that contribute the most to contiguity; keep contents for as long as possible (even when part of a reservation: if reactivated, break reservation)
Cluster wired pages

Auxiliary data structure: population map

Population maps keep track of allocated base pages within each memory object.
Purpose:
- Reserved frame lookup
- Overlap avoidance
- Promotion decisions
- Preemption assitance
Implemented as a radix tree. Details see paper

Remarks

Another well-written paper. I think it is a good candidate for a self-learning project (i.e., implementing superpage management system on JOS)

"Threads and Input/Output in the Synthesis Kernel"

2019-03-10T12:20:00+08:00

Problem
System Designs
Remarks

Problem

Design an OS for a parallel and distributed computational environment and achieve the following three goals:

High performance
Self-tuning capability to dynamic load and configuration changes
A simple, uniform and intuitive model of computation with a high-level interface

System Designs

Dataflow: Synthesis Model of Computation

The threads of execution form a directed graph, in which the nodes are threads and the arcs are data flow channels
Observation: Data follows a pipeline connects many OS-managed devices/resources (e.g., capture | xform | filter | detect &)

Above shows a dataflow example from Synthesis: file system server. It differs from a traditional design:
- Boxes are threads/servers
- Boxes are connected directly with jmp instructions to implement scheduling
- Arrows (and boxes) are specialized dynamically to the application

Fast context switch: procedure chaining

Picture above shows that the ready-to-run threads are chained in an executable circular queue. A jmp instruction in each context-switch-out procedure of the preceding thread points to the context-switch-in procedure of the follow thread.
"Executable data structures": embed code in data structures to avoid data structure traversals and to specialize code for each object (e.g., put context switch code inside of thread control block)
Context switch steps: Timer interrup vectored directly to current thread's sw_out; sw_out calls (directly) next thread's sw_in or sw_in_mmu:
- interrupt vectored to sw_out
- sw_out saves registers
- sw_out jumps to next sw_in_mmu
- sw_in_mmu updates MMU
- sw_in_mmu updates CPU interrupt vector base
- sw_in_mmu restores CPU registers (including putting user-PC into the user-PC register)
- sw_in_mmu does return from exception (replacing PC with user-PC and changing mode back to user mode)

Mechanism to reduce syncrhonization overhead

Lots of techniques used to reduce synchronization overhead

Code Isolation: reduce false sharing (i.e., eliminate false sharing within a single C struct) (e.g., thread table entries (TTEs) are not shared. Similar to privatization)
Procedure Chaining: use continuations (implemented by changing the return address on the stack) to allow certain services to complete atomically (e.g., defer signal to end of interrupt handling)
Optimistic synchronization: it is easier to break the rule and ask forgiveness than get permission. Try the operation, but before commit, check to see if no one else interfered.
Lock-free queues that use the compare-and-swap instruction ¹. This is not wait-free (some operations do not have bounded waiting time), it is obstruction-free (a thread, executed in isolation for a bounded number of operations will complete).

Remarks

Even I cannot fully understand every bit details of the paper, I think it is the best written paper I have read so far in the semester (the flow is great).
Many details in the paper are omit: scheduling, interrupt handling, details of lock-free queues (SP-SC, MP-MC,etc)
Lots of cool techniques that are worth investigation in its own rights and they are "field openers"
- Code synthesis (JIT compiler, super-optimizers)
- Code isolation (Privatization)
- Procedure chaining (Continuations (lambda, events))
- Optimistic synchronization (lock-free data structures)
- Synthesis I/O (Dataflow: Scout, Click Router, SEDA, StageServer, IXP, PTask, etc)

See OSTEP::locks for details on test-and-set and compare-and-swap instruction and their usage in lock implementations. ↩

"END-TO-END ARGUMENTS IN SYSTEM DESIGN"

2019-03-09T21:20:00+08:00

Main Point

The following statements provide different angles of stating the same "end-to-end argument":

Functions placed at low levels of a system may be redundant or of little value when compared with the cost of providing them at that low level; low level mechanisms to support these functions are justified only as performance enhancments.
Since application knows its needs the most, it is natural for system to move function upward in a layered system, closer to the application that uses the function.
The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancment."Worse is better")
The end-to-end argument says that many functions in a communication system can only be completely and correctly implemented with the help of the application(s) at the endpoints ¹

Example: File transfer

In the paper, a scenario about transferring file between host A and host B is used to demonstrate "end-to-end argument".

File transfer application still needs to compare checksum on receving host with the sending host to ensure the file consistency and retransfer the file if the checksum differ even the underlying communication system has implemented similar mechanism to ensure no change bits in a packet or packet gets dropped. In other words, the file transfer application must provide its own retries based on an end-to-end checksum of the file. And if it does so, the extra effort expended in the communication system to provide a guarantee of reliable data transmission is only reducing the frequency of retries by the file transfer application; it has no effect on inevitablity or correctness of the outcome, since correct file transmission is assured by the end-to-end checksum and retry whether or not the data transmission system is reliable. Thus, the data communication system to go out of its way to be extraordinarily reliable does not reduce the burden on the application program to ensure reliability.

Thus, low levels need not provide "perfect" reliability (e.g., communication system in this example) as the upper application still needs to implement the same functionality again to ensure correctness. However, communication system still needs to implement some mechanism to improve application performance (e.g., reduce application retries in this case). However, the point is that such mechanism needs not to be "perfect" (e.g., strive for a negligible error rate so that retries never happens). In addition, lower levels may not have enough information to "perfect" functions used by applications and the overhead introduced by such functions tax on the applications that are on the same system but never need such functionality.

More examples are discussed in paper: (application level) encryption, duplicate message detection, message sequencing, guaranteed message delivery, detection of crashes, and delivery receipts.

From morning paper ↩

"Exokernel: An Operating System Architecture for Application-Level Resource Management"

2019-03-09T13:20:00+08:00

Problem

How can we design an OS such that it improves performance of standard applications while providing flexibility to enable applications to customize resource management to improve performance?

Approach

extensibility: application knows best what resource mangement it needs. Therefore, it should make decisions whenever possible (end-to-end argument)
minimalist: kernel's job is to protect resources, not to manage them ("separate protection from resource management")
challenge: identify core of the abstractions for different resources
Thin kernel, fat OS libraries

Background: extensibility

There are five approaches to extensibility (including Exokernel):

OS per application

Example: Fluke
Idea:
- Hypervisor provides resource management and isolation
- Additional guest-OS layers redundant and unnecessary
- Collapse guest OS and application into same domain (typically compiles OS and app into the same binary)
Pros:
- Fast (same advantage provides by Unikernel)
Cons:
- Co-existence applications
- Kernels are fragile and hard to modify

Microkernels

Examples: Hydra, Mach
Idea:
- Minimal OS core to manage hardware
- Higher level abstractions in user space
- IPC fundamental cross-domain primitive
Pros:
- Fault isolation
- Better extensibility
Cons:
- Slow (kernel crossings)
- Limited extensibility (maky make it easeir for OS developer to extend, but not user)

Virtual machines

Examples: VM370, Disco, VMware, Xen
Ideas:
- Different apps need different OSes
- Figure our how to run more than one OS at a time
Pros:
- low-level interface ("ideal" according to Exokernel standard)
Cons:
- "emulate" machine vs. "export" resources (e.g., need to emulate "privileged" instructions)
- Poor IPC (traditionally): machines isolated
- Hide resource management

Download untrusted code into kernel

Examples: Spin, Vino
Ideas:
- OS provides extensibility interfaces
- Apps provide extensions that execute in kernel mode
Pros:
- extension
Cons:
- Still working with same OS structure
- Only extensible within limits of extensbility API
- New thicket of isolation and trust issues

System designs

High-level Architecture

Like previous four types of system, exokernel is another system architecture style focusing on the system extensibility.

Top level structure:
- A small monolithic kernel
  - low-level, fixed interface
  - Ideally hardware interface
  - few and simple abstractions
  - extension types: resource state data (page table entries), specialized resource management modules
- Libraries of untrusted resource management routines
  - VM replacement
  - File System
  - IPC
  - ...

Note

Libraries are part of OS. Historically, OS was set of libraries for math, etc. However, it is not true today.

Key difference - trust: application can write over library, jump to bad address in library, etc. Thus, kernel cannot trust library.
Exokernel borrows liberally from other approaches:
- Like Fluke: make it easy for each app to have custom OS
- Like virtual machine: exokernel exports virtual machine but different in
  - Transparency: traditional VM wants to run unmodified OS's; exokernel VM wants to support custom OS's
  - Export rather than emulate resource: libOS is aware of multiplexing
- Like Vino, Spin: one mechanism for extensibility is to download untrusted code into kernel
Philosophy
- Traditional OS = protection + abstraction
- Exokernel:
  - Protection = kernel (minimal mechanism) + library (resource sharing policy)
  - Abstraction = library

Exokernel principles

Separate protection and management
- export resources at lowest level possible with protection (e.g., disk blocks, TLB entries, etc)
- resource management only at level needed for protection (e.g., allocation, revocation, sharing, tracking of ownership)
- "abstraction (mechanism) is policy": the implementation of abstractions in library operating systems can be simpler and more specialized than in-kernel implementations, because library operating systems need not multiplex a resource among competing applications with widely different demands
Expose allocation: allocations allocate resources explicitly
Expose names: use physical names (physical memory (cache coloring), disk arm position)
Expose revocation: let apps choose which instances of a resource to give up
Expose information: let application map in (read only) internal kernel data structures (e.g, software TLB, CPU schedule, etc)
Exterminate all operating system abstractions (end-to-end)

Key Mechanisms

Secure bindings

Bind at large granularity; access at small granularity (allow kernel to protect resources without understanding them)
Do access check at bind time, not access time (e.g., when loading TLB entry for a page, not when accessing page)
Examples:
- Hardware: TLB
- Software: Software TLB cache
- Download code (e.g., packet filter): type safe language, sandboxing, interpreters, etc
- Traditional file system: open file/read and write file
Challenge: secure bindings vs. Saltzer "complete mediation"

Visible revocation

Transparent revocation (Traditional OS)
- OS decides how many resources to give to apps
- OS chooses what to revoke and takes it
- Needed for performant frequent revocation (e.g., address space identifier (ASID))
Notify on revocation (Exokernel)
- abort protocol; repossession vector; scheduler activations
- OS decides how many resources to give to apps
- OS chooses what to revoke, takes it, and tells application (or libOS)
  - Call application handler when taking away page, CPU, etc
  - Application can react: update data structures (e.g., reduce # of threads when CPU goes away; scheduler activations) and decide what page to give up
- Reposes dirty disk block (store to "swap server")
- ASIDs (processor addressing-context identifiers) are identified as a resource best revoked transparently because of frequent revocation
Cooperative revocation (Exokernel)
- callbacks
- OS decides how many resources to give to apps
- OS asks application or libOS to give up a resource; libOS/app decides which instance to give up

Abort protocol

When voluntary revocation fails, kernel tells application what it took away. Doing so helps library to maintain valid state specification

Capabilities

Encryption-based tokens to prove right to access
Idea is to help kernel make access-rights decision
Pros:
- Simple
- Generic across resources
- Hierarchical: using capabilities to protect resources enables applications to grant access rights to other applications without kernel intervention. Applications can also use "well-known" capabilities to share resources easily

Others

Wakeup predicates: wake up process when arbitrary condition becomes true (checked when scheduler looking for something to run)
Buffer cache registry: bind disk blocks to memory pages (applications can share cached pages)
Block stat to order writes
UDF

Specific Abstractions

Many abstractions need to be implemented in exokernel: exception handler, page protection/sharing, processor scheduling, fork/exec, VM replacement, network protocol, file system. Here, I only list paper's discussion related to network.

Network

Multiplexing the network: packet filter

Idea: load a small piece of code that examines packet and decides if it is for me
Implement by downloading code into kernel: written in simple, safe language - no loops, check all mem references, etc
Problem: what if I lie and say "yes it is for me" when it isn't?
- Solution: "assume they don't lie"
- Claim: could use a trusted server to load these things or could check to make sure that a new filter never overlaps with an old one [not like to solve the problem]

Application-specific safe handlers (ASH)

Application-level message handlers that are downloaded into kernel (can reply to packet without context switch)
- Example: auspex file server responds to NFS getattr requests in hardware in network interface
Pros:
- direc message vectoring: ASH knows where message should land in user memory and thus, avoid copies
- dynamic integrated layer processing (e.g., do checksum as data is copied into network inteface)
- message initiation (fast replies)
- No danger of deadlock
- control initiation (active messages)
If we see Figure 2 of the paper: without ASH, exokernel just drops message in application buffer and later, when application is scheduled, application handles it. Since in the paper, round robin scheduler is used, we see linear increase in ping latency.

"The UNIX Time- Sharing System"

2019-03-03T11:13:00+08:00

Problem
System designs
Remarks

Problem

How to design a interative-use system with easy use (i.e., write, test, and run programs) given the hardware constraint?

System designs

Architecture

File System

Untyped data (byte oriented)
- Structure of files is controlled by the programs which uses them, not by the system
Hierarchical name space
- Strict hierarchy across directories
  - The directory structure is constrained to have the form of a rooted tree (i.e., each directory must appear as an entry in exactly one other, which is its parent)
- Disallowing multiple links to directories
  - Advantages: easier search; easier garbage collection (no cycles)
Directories are files
- A directoy behaves exactly like an ordinary file execpt that it cannot be written on by unprivileged programs, so that the system controls the contents of directories; anyone with proper permission can read a directory just like ordinary files.
- Linking: same nondirectory file may appear in several directories under possibly different names
Treating I/O devices as files
- Special files exist for each communication line, each disk, each tape drive, and for physical core memory
- Advantages:
  - File and device I/O are as similar as possible
  - File and device names have the same syntax and meaning
  - Special files are subject to the same protection mechanism as regular files
Mount
- Removable storage; expand storage
  - mount replaces a leaf of the hierarchy tree (the ordinary file) by a whole new subtree (the hierarchy stored on the removable volume)
- No cross-volume links allowed
Set-User-ID (right amplification)
- Setuid (set user ID on execution) permits users to run certain programs with escalated privileges (i.e., execute a program as someone else)
- Example: when a user wants to change their password, they run the passwd command. The passwd program is owned by the root account and marked as setuid, so the user is temporarily granted root access for that very limited purpose. Since user id of invoker is known, it's passwd program responsibility to make sure the invoker's proper behavior.
File descriptor
- A small integer used to identify the file in subsequent calls to read, write
- Used as a index to a system table that contains file's device, i-number, and read/write pointer
- Filter programs (e.g., ls | pr -2 | opr) do not know the name of input or output files
- "Handle with access rights” (that is a capability, which is an abstraction that makes protected sharing easier); it is a capability because the right can be transferred to other processes that want to access the file
Implementation
- A directory contains file name and a pointer to file; the pointer is the i-number of the file. When the file is accessed, its i-number is used as an index into a system table (i-list) stored in a known part of the device on which the directory resides.
- i-number is used as a index to find corresponding entry inode in i-list; inode contains description of file
- A small (nonspecial) file fits into eight or fewer blocks; in this case the addresses of the blocks themselves are stored. For large (nonspecial) files, each of the eight device addresses may point to an indirect block of 256 addresses of blocks constituting the file itself. These files may be as large as $8 \cdot 256 \cdot 512$
- ... more on paper

Process Management

An image is a computer execution environment. It includes a core image (contains three segments: code, heap, stack), general register values, status of open files, current directory, and the like. An image is the current state of a pseudo computer.

Note

I would like to think about image as a snapshot of the process, which cannot be modified. This is useful during multitasking as OS can save current process states as an image before running other processes. At some point, OS can load image back again to continue running the process. See here and here for more info.

A process is the execution program of an image.
Building blocks
- Fork, exec, wait
- File I/O structure
  - Fork'd child shares parent's open files; have indenpendent copy of the original core image
  - Pipes: not a completely general mechanism since the pipe must be set up by a common ancestor of the processes involved (working mechanism see paper)
  - Standard I/O
    - Programs executed by the shell start off with two open files which have file descriptors 0 (stdin) and 1 (stdout)
    - Enables redirection and pipelines easily
  - Shell
    - bg execution
    - initialization: last step in the initialization of UNIX is creation of a single process and the invocation of init; init creates child processes which will later become shells.

Remarks

On treating I/O devices as files advantages: (1) seems useful but high-performance implementation tend to treat e.g., network I/O differently from disk. (2) seems useless ("ioctl" interface for device-specific functionality is terrible); (3) is compelling
"There have been three versions of UNIX." Fred Brooks is right
I learn some new usage in UNIX (e.g., pr - paginates its input with dated headings; ed as a editor; (date; ls) > x - use parentheses)

Brush up OS

2019-01-23T22:24:00+08:00

This post aims to prepare myself for the upcoming CS380L Advanced Operating Systems offered by Christopher J. Rossbach. The questions are actually from his HW1 aka. "Swapping in the state from undergraduate OS".

Brush up

Brush up

Definitions

Define the following terms:

Internal and external fragmentation. Which of them can occur in a paging system? A system with pure segmentation?

Fragmentation happens when we talk about memory allocation (e.g., user requests memory through malloc(); OS manages physical memory when using segmentation to implement virtual memory). There are two types of fragmentation:

internal fragmentation: unused memory within a unit of allocation: if an allocator hands out chunks of memory bigger than that requested, any unasked for (and thus unused) space in such a chunk is considered internal fragmentation (because the waste occurs inside the allocated unit). Metaphorically speaking, internal fragmentation happens when a party of three at a table for four.
external fragmentation: unused memory between units of allocation: the free space gets chopped into little pieces of different sizes and is thus fragmented; subsequent requests may fail because there is no single contiguous space that can satisfy the request, even though the total amount of free space exceeds the size of the request. Metaphorically speaking, external fragmentation happens when two fixed tables for two but a party of four.

In a paging system, since the free space is divided into fixed-size units (i.e., pages), no external fragmentation can happen: a page request will always get satisified as all page size are equal; we can simply return one of them. However, internal fragmentation can still happen. For example, if the page size is 4KB and we request 3KB memory, then there will be 1KB unused memory within the returned page.

In a system with pure segmentation, only extenral fragmentations can happen. Since segments are variable-sized units, external fragmentation cannot be avoided. However, the system can give out exact size of requested memory (no paging exists), then there is no internal fragmentation.

Note

OSTEP Chapter 16 and Chapter 17 offer more on segmentation, which causes fragmentation, and free-space management that handles external fragmentation issue.

Translation look-aside buffer (TLB).

TLB is part of the chip's memory-management unit (MMU). It serves as a hardware cache of popular virtual-to-physical address translations (i.e., virtual page number to physical frame number translations)¹.

Interrupt.

Interrupts are essentially requests for attention. In the same way, peripherals in a computer system can request the attention of the processor. The event that makes a microprocessor stop executing one routine to perform some other routine to service a request, is called an INTERRUPT². In other words, an interrupt is an electrical signal that causes the processor to stop the current execution and begin to execute interrupt handler code. Interrupts are asynchronous excpetions. Asynchronous here means "not related to instruction that just executed"³.

Distributed shared memory.

Distributed shared memory is a memory architecture that a cluster of nodes share the same logical address space but physical memories are from each node in the cluster and connected via the network (i.e., a form of memory architecture where physically separated memories can be addressed as one logically shared address space) ⁴

Stateful and stateless servers. Also, list two file systems, one that is stateful and one that is stateless, and explain how having or not having state affects file service.

Stateless server means the server does not keep track of anything about what is happening at each client (e.g., server does not maintain which client open what file on its side). Stateful server is the opposite of the stateless server. For example, server keeps track of what client accesses what files so that when the server's file copy is modified (e.g., by some other clients who also access the same file), server can inform clients to invalidate their cache of the same file.

A classic stateless file system is NFS with NFSv2 protocol (NFSv4 protocol makes NFS become a stateful server as well). AFS is a stateful file system.

Not having state makes crash recovery on the server side fast: server doesn't need to recover any client-side information (thanks to stateless) and on server failure, clients can simply retry the request. However, with stateless, clients need to constantly contact server to validate their cache, which impose extra load to server. Furthermore, clients need to pass extra information (e.g., file handle) when perform system calls, which impose extra overhead to clients. Having state complicates the crash recovery: server needs to initiate cache validation to the clients; clients also need to send out heartbeat message to server. However, with state, client doesn't need to constantly contact to validate its cache (wait for server's callback).

Swapping.

Since all pages cannot reside in the physical memory, to support large address space, we use part of disk (e.g., swap space) by swapping pages in and out between physical memory and swap space ⁵.

Inverted page table.

Inverted page table (along with multi-level page tables) aims to save memory space taken by the page tables. Instead of having one page table per process, we keep a single page table that has an entry for each physical page of the system. The entry tells us which process is using this page, and which virtual page of that process maps to this physical page ⁶. PowerPC uses this technique.

Disk sector, track, and cylinder.

A disk sector refers to 512-byte block on the disk (aside note, drive manufacturers guarantee that a single 512-byte write is atomic). Sectors of disk are organized as a set of concentric circles; each concentric circle of sectors is called a track ⁷. A cylinder is a set of tracks on different surfaces of a hard dirve that are the same distance from the center of the drive; it is called a cylinder because of its clear resemblance to the so-called geometrical shape ⁸.

Thrashing.

The situation when memory demands of the set of running processes exceeds the available physical memory. System performace degrades due to the fact that system CPU time is dominated by the activity of swaping pages in and out between physical memory and swap space on disk ⁹.

Short answer

Provide a short answer to each of the following questions:

Give a reason why virtual memory address translation is useful even if the total size of virtual memory (summed over all programs) is guaranteed to be smaller than physical memory.

Virtual memory address translation is useful in providing isolation (safety) across different processes: OS can make sure that one proacess cannot access part of physical memory that is owned by the other process. In addition, address translation helps OS to better manage the underlying physical memory: every process thinks address space starts at 0, which may not be true on the physical memory; for better memory management, OS may relocate the address space to some other physical address. Address translation enables OS to perform such memory management while makes the relocation be transparent to process ¹⁰.

Compare and contrast access control lists and capabilities.

In an access control list model of security, the authorities are bounded to the objects being secured (e.g., file A can be r/w by user Alice). By contrast, in the capabilities model, the authorities are bound to objects seeking access ¹¹ (e.g., UNIX file descriptors are capabilities: authorities are bounded to file descriptors, which reflect rights on objects like files or sockets ¹²).

As one can see, with the ACL model, each object has just one list, while with the capability model, each object has a whole set of different, separable capabilities. For example, in capability model, a process has different capabilities depending on the objects it tries to access.

This page has a nice example: file "aaa" has ACL as "Alice:R/W, Bob:R, Carol:R". File "aaa" is the object being secured. "Alice" has capabilities "aaa:R/W, bbb:R, ccc:R". "Alice" as a user is the object seeking access.

The length of the time slice is a parameter to round robin CPU scheduling. What is the main problem that occurs if this length is too long? Too short?

There are two important metrics in scheduling: turnaround time and response time. turnaround time of a job is defined as $T_{completion} - T_{arrival}$ (the time at which the job completes minus the time at which the job arrived in the system). response time is defined as $T_{firstrun} - T_{arrival}$ (the time from when the job arrives in a system to the first time it is scheduled). Round robin (RR) CPU scheduling is optimized towards response time. If time slice is too long, then the response time is worse. By contrast, if time slice is too short, the cost of context switching will dominate overall performance, which means less work is done during the time slice ¹³.

List an advantage and a disadvantage to increasing the virtual memory page size.

The advantages of increasing the virtual memory page size are: reduces page table size ¹⁴, improves TLB hit rate ¹⁵. The disadvantage is the internal fragmentation (i.e., waste within each page; the waster is internal to the unit of allocation) ⁶.

Virtual memory addressing

Suppose we have a machine that uses a three-level page table system. A 32-bit virtual address is divided into four fields of widths a, b, c, and d bits from left to right. The first three are indices into the three levels of page table; the fourth, d, is the offset. Express the number of virtual pages available as a function of a, b, c, and d. Give one advantage and one disadvantage to multi-level page tables.

Since we have a three-level page table system, then $a$ corresponds to Page Directory Index (PDI) 0, $b$ corresponds to PDI 1, $c$ corresponds to Page Table Index. Each page contains $2^c$ Page Table Entries (PTE). Page Directory 1 needs one entry per page and contains $2^b$ entries. Similarly, Page Directory 0 needs one entry per page, which contains one Page Directory 1 and there are $2^a$ entries. Thus, there are $2^a$ Page Directory 1, and each Page Directory 1 can point to $2^b$ pages, and each page contains $2^c$ PTEs, and each PTE corresponds to one virtual page. Therefore, there are $2^{a+b+c}$ virutal pages. Another way of cacluation is that the physical frame size is $2^d$, which is equal to page size. Since the address space is 32 bit, there are $\frac{2^{32}}{2^d}$ virtual pages ¹⁶.

The advantages of multi-level page tables are that:

The multi-level table only allocates page-table space in proportion to the amount of address space we are using (more compact to support sparse page table)
each portion of page table fits into a page makes memory management easier: we don't have to find contiguous physical memory chunk that can contain the linear page table; we can place page-table pages whereever we want in the physical memory.

The disadvantage is that additional memory accesses to look up a valid translation (e.g., for 3-level page table system, we have three additional memory accesses: access Page Directory Entry 0 (PDE0), access PDE1, access PTE)¹⁷.

Page replacement

Suppose a machine with 4 physical pages starts running a program (in other words, the physical pages are initially empty). The program references the sequence of virtual pages as follows: A B C D E D C B A E D C B A C E For each of the following paging algorithms, replicate the reference pattern and underline each reference that causes a page fault (or make references that cause a page fault uppercase, and those that don’t lowercase):

LRU.

LRU stands for Least-Recently-Used, which is one of the page replacement policies (in practice, we use clock algorithm to approximate LRU ¹⁸ to avoid heavy overhead). The trace of the memory references with LRU policy shown below:

Access	Hit/Miss	Evict	Resulting Cache State (LRU at front, MRU at tail)
A	Miss		A
B	Miss		A,B
C	Miss		A,B,C
D	Miss		A,B,C,D
E	Miss	A	B,C,D,E
D	Hit		B,C,E,D
C	Hit		B,E,D,C
B	Hit		E,D,C,B
A	Miss	E	D,C,B,A
E	Miss	D	C,B,A,E
D	Miss	C	B,A,E,D
C	Miss	B	A,E,D,C
B	Miss	A	E,D,C,B
A	Miss	E	D,C,B,A
C	Hit		D,B,A,C
E	Miss	D	B,A,C,E

From the table, we have A,B,C,D,E,d,c,b,A,E,D,C,B,A,c,E.

FIFO.

FIFO means first-in, first-out: pages are placed in a queue as the order they are brought into physical memory; when replacement occurs, the first-in page is evicted. The similar trace of memory references with FIFO policy shown below:

Access	Hit/Miss	Evict	Resulting Cache State
A	Miss		A
B	Miss		A,B
C	Miss		A,B,C
D	Miss		A,B,C,D
E	Miss	A	B,C,D,E
D	Hit		B,C,D,E
C	Hit		B,C,D,E
B	Hit		B,C,D,E
A	Miss	B	C,D,E,A
E	Hit		C,D,E,A
D	Hit		C,D,E,A
C	Hit		C,D,E,A
B	Miss	C	D,E,A,B
A	Hit		D,E,A,B
C	Miss	D	E,A,B,C
E	Hit		E,A,B,C

Thus, we have A,B,C,D,E,d,c,b,A,e,d,c,B,a,C,e.

Optimum.

Optimal policy means we want to replace the page that will be accessed furthest in the future and doing so will result in the optimal policy with fewest possible cache misses. The resulting trace with optimal policy follows:

Access	Hit/Miss	Evict	Resulting Cache State
A	Miss		A
B	Miss		A,B
C	Miss		A,B,C
D	Miss		A,B,C,D
E	Miss	A	B,C,D,E
D	Hit		B,C,D,E
C	Hit		B,C,D,E
B	Hit		B,C,D,E
A	Miss	B	C,D,E,A
E	Hit		C,D,E,A
D	Hit		C,D,E,A
C	Hit		C,D,E,A
B	Miss	D	C,E,A,B
A	Hit		C,E,A,B
C	Hit		C,E,A,B
E	Hit		C,E,A,B

We have A,B,C,D,E,d,c,b,A,e,d,c,B,a,c,e.

Multiprocessing

Suppose you have a large source program containing m files that you want to compile. You have a cluster of n “shared-nothing” workstations, where n > m, on which you may compile your files. At best you will get an m-fold speedup compared to a single processor. List at least three reasons as to why the actual speedup might be less than this.

Interconnection overhead. To perform distributed compilation, there are network communication cost and coordination cost about assigning files to different workstations, and communication across machines during the compilation (e.g., A file on one workstation may have dependencies of the files assigned to other machines. To make the file into object code file, file copies from other machines are needed).
Computation graph may have dependency that is not parallelizable (e.g., linking).
Resources limitation. We have a cluster of n workstations but we may not have fully-usage of the underlying resources (i.e., our jobs may be subject to cluster coordinator scheduling).
File size are unequal. Suppose we have 5 files to compile: first 4 need 1s compilation time while the last one needs 10 minutes. Then, whether we perform distributed compilation or not doesn't make difference: the last file will make become bottleneck and we cannot get 5-fold speedup.

Achieving fast file reads

(a) Give at least three strategies that a file system can employ to reduce the time a program spends waiting for data reads to complete. (b) For each strategy you listed, describe a read pattern for which the strategy would do well, and one for which the strategy would do poorly

Use cache (e.g., buffer cache, page cache of the virtual memory) to pre-read the requested data block and several subsequent data blocks at the same time on open(). Then, data can be read from cache in memory instead of disk to avoid expensive disk I/O.
- Sequential read performs well
- Random read performs poorly
We can compress the data when we store them at the first place. By reducing the size of file, we can reduce the amount of disk I/O done for reading the file and spend more CPU time to uncompress the data, which is naturally faster than doing disk I/O instead.
- Big read performs well
- Small read performs poorly
We can use RAID Level 0 (i.e., striping) to spread the data block across the disks in a round-robin fashion. When we read the data, we can utilize multiple disks in parallel.
- Sequential read performs well
- Random read performs poorly
We can place file's inode close to file's data blocks so that during data read, we can reduce seek and rotational delay costs of HDD.
- Sequential read performs well
- Random read performs poorly
Prefetching: we can try to predict what data/files the user will read next and buffer the file content into memory.
- Sequential read performs well
- Random read performs poorly
Build index to the most visited files (i.e., direct pointer to the files) so that we don't have to traverse internal data structure of file system (i.e., similar to TLB).
- Read hot data performs well
- Read cold data performs poorly

Synchronization

Your OS has a set of queues, each of which is protected by a lock. To enqueue or dequeue an item, a thread must hold the lock associated with the queue. You need to implement an atomic transfer routine that dequeues an item from one queue and enqueues it on another. The transfer must appear to occur atomically. This is your first attempt:

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    queue1->lock.Acquire();
    thing = queue1->Dequeue();
    if(thing != NULL) {
        queue2->lock.Acquire();
        queue2->Enqueue(thing);
        queue2->lock.Release();
    }
    queue1->lock.Release();
}

You may assume that queue1 and queue2 never refer to the same queue. Also, assume that you have a function Queue::Address() which takes a queue and returns, as an unsigned integer, its address.

(a) Explain how using this implementation of transfer() can lead to deadlock

One possible scenario of deadlock using transfer() is illustrated below. Note that queue1 and queue2 in transfer() are replaced with queues that they are pointing to (time increases with the row's number):

Thread1 (`transfer(q1,q2)`)	Thread2 (`transfer(q2,q1)`)
`q1->lock.Acquire();`
`thing = q1->Dequeue();`
`if (thing != NULL)`
interrupt: switch to Thread2
	`q2->lock.Acquire();`
	`thing = q2->Dequeue();`
	`if (thing != NULL)`
	`q1->lock.Acquire(); // block & wait on q1's lock`
	interrupt: switch to Thread1
`q2->lock.Acquire(); // block & wait on q2's lock`

(b) Write a modified version of transfer() that avoids deadlock and does the transfer atomically.

pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    pthread_mutex_lock(&m);
    queue1->lock.Acquire();
    thing = queue1->Dequeue();
    if(thing != NULL) {
        queue2->lock.Acquire();
        queue2->Enqueue(thing);
        queue2->lock.Release();
    }
    queue1->lock.Release();
    pthread_mutex_unlock(&m);
}

Alternatively, we can do:

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    pthread_mutex_lock(&m);
    while (queue2->lock.available())
    {
      queue2->lock.Acquire();
    }
    queue1->lock.Acquire();
    pthread_mutex_unlock(&m);
    thing = queue1->Dequeue();
    if(thing != NULL) {
        queue2->Enqueue(thing);
        queue2->lock.Release();
    }
    queue1->lock.Release();
}

(c) If the transfer does not need to be atomic, how might you change your solution to achieve a higher degree of concurrency? Justify why your modification increases concurrency.

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    queue1->lock1.Acquire();  /* lock1 for Dequeue() */
    thing = queue1->Dequeue();
    queue1->lock1.Release(); 
    if(thing != NULL) {
        queue2->lock2.Acquire(); /* lock2 for Enqueue() */
        queue2->Enqueue(thing);
        queue2->lock2.Release();
    }
}

It's more fine-grained than the previous approach because compared to previous approach which only one thread can touch one queue, two threads can touch on the same queue: one for enqueue and one for dequeue.

Alternatively, we can do:

void transfer(Queue *queue1, Queue *queue2)
{
    Item thing; /* the thing being transferred */
    queue1->lock.Acquire();
    thing = queue1->Dequeue();
    queue1->lock.Release();    
    if(thing != NULL) {
        queue2->lock.Acquire();
        queue2->Enqueue(thing);
        queue2->lock.Release();
    }
}

The length of time we hold lock is shorter (i.e., at any time, the function is holding only one lock).

Networking

You are developing a network protocol for the reliable delivery of fixed-sized messages over unreliable networks. You are using a sequence number in each message to allow the receiver to eliminate duplicates, but you still have three design alternatives to consider. The design alternatives are: (1) the sender must receive an acknowledgement for the previously sent message before it can send the next message in the sequence, (2) the sender can transmit up to n unacknowledged messages, but the receiver will discard any messages that are received out of sequence (in other words, it will only acknowledge a message if it is received in sequence), and (3) the sender can transmit up to n unacknowledged messages, and the receiver will acknowledge each on receipt, even if they arrive out of order. For each alternative, answer the following questions

(a) Explain what state the receiver must keep around to implement each of the three alternatives (remember, the receiver must be able to detect and discard duplicates).

For 1), we keep track of expected sequence # (or last received sequence #)
For 2), we keep track of expected sequence # (or last received sequence #)
For 3), we keep track of a buffer with size $n$ and make sure there is no duplicates by sorting them. If there is a duplicate, the ack will be sent with the missing sequence # back to sender. The receiver only increments the lower bound sequence # of buffer when $n$ messages within the buffer satisfying the requirement (invariant: all the sequence number smaller than lower bound is well-ordered and no missing).

(b) Suppose two machines are communicating across a bi-directional network link with fixed-size messages. Acknowledgement packets are also fixed-size. One machine, the sender, is transmitting a very large file to the other machine, the receiver. That is, the sender is transmitting a large number of messages that make up this file. Discuss briefly how each alternative’s data bandwidth is expected to vary given different underlying network characteristics.

If there is no loss of the packets during the transit, (3) has the highest data bandwidth as there are maximum amount of packets in the flight and receiver will acknowledge even the packets are out of order. By contrast, (1) has the lowest data bandwidth due to sender can only send out a packet when an ack is received. If loss is seldom, (3) has highest data bandwidth. If there is loss, the more servere the loss, the more bandwidth (1) can achieve compared to others. (3) is worst in this case because (3) performs lots of useless work (send out n packets but get dropped anyway; leads to congestion).

When latency is small, (3) is the most effective; (1) is the worst. When latency is large, (1) is the most effective; (2) is the worst because of possible congestions and the order of packets is very likely be out-of-order, and then discarded by receiver, and sender will need to resend again.

When everything is good, (2) and (3) are similar.

Serving multiple clients

There are two main approaches to organizing a server daemon, such as a web server: a) Create a new kernel thread for each client (for each web browser connection); b) Use a single process responding to all clients, usually based on the select() system call. Compare and contrast these two approaches.

Multi-threaded architecture ¹⁹:

A pool of worker threads handle incoming requests
One worker thread usually corresponds to one connection
A dispatcher thread blocks socket on connection and once establised, the connection is passed to a queue of connections where worker threads can take connections from the queue and handle the request
Queue of connections is bounded: awaiting connections to be handled is limited; extra connections will be rejected
Predicatable latency and prevents total overload
Can utilize multiple CPU cores (*)
A synchronous, blocking I/O architecture

Event-based architecture:

asyncronous, non-blocking
one process (thread) handles multiple connections
Events are queued and will be dequeued by the event loop (single thread)
Each event has corresponding event handler
Usually the event is handled in a cascade of callbacks
Depends on the implementation, the event can be handled in the same thread as the event loop (i.e., event loop will be blocked when handle the event) or a pool of threads will be used for event handler
Compared to multi-threaded architecture, number of threads used is reduced (consequently, get rid of excessive context switching overhead, no thread stack for each connection); event-based architecture scales with increasing throughput; latency of requests increases linearly when overload

The key difference between these two architectures is who do the scheduling: in multi-threaded architecutre, OS performs scheduling by performing context-switch of different worker threads. However, for event-based architecture, event loop thread essentially works as a scheduler: multiplexing multiple connections to a single flow of execution. In addition, for event-based architecture, we need to implement preemptive interrupt on event handler thread pool as well.

In addition, the kernel threads approach can have easily lead to system crash due to all the work is in kernel-space. In addition, kernel thread has higher context-switch overhead compared to threads within single process of the second approach ¹². However, kernel threads can scale up with number of CPU cores while single process approach cannot benefit from multi-core.

taken from OSTEP Chapter 19 ↩
taken from Phil Storrs PC Hardware book ↩
Alison's CS439 Lec 02 and CSAPP (2nd edition) 8.1.2 ↩
See [@@Protic:1996] and Wikipedia ↩
See OSTEP Chapter 21 ↩
taken from OSTEP Chapter 20. More details found in Alison's CS439 Lec 12 "Virtual Memory: Mechanisms" ↩↩
See OSTEP Chapter 37 ↩
See OSTEP Figure in Chapter 41.3 ↩
The term is originally defined in [@@Denning:1968:TCP:1476589.1476705]. But I find this article's explanation is more intuitive. ↩
See OSTEP Chapter 15 for more details. ↩
See Capabilities and ACLs ↩
See [@@shapiro1999eros] and [@@watson2010capsicum] ↩↩
See OSTEP Chapter 7. ↩
One cacluation example can be found from OSTEP Chapter 20.1. ↩
See OSTEP Chapter 19.2 example. ↩
OSTEP Chapter 20, Chapter 18, and this page may help with cacluation. ↩
See OSTEP Chapter Figure 20.6. ↩
See OSTEP Chapter 22 ↩
See Concurrent Programming for Scalable Web Architectures ↩
See CSE451 ↩

Modify char in another function

2018-09-26T10:00:00+08:00

Almost two years ago, I write a post on how to modify an array in one function through another function in C. I did pretty detailed study through GDB there but I find that the illustration is lengthy to read. In this post, I try to show the same concept using char *. Hopefully, this time I do a better job.

Problem
Explanation

Problem

We are given the following program:

#include <stdio.h>

// modify this function
void function()
{

}

int main()
{
    char* s;
    function(); // modify here
    puts(s);
    return 0;
}

We want to implement function() such that we can print out Hello World! to the screen. The result of the modification looks like below:

#include <stdio.h>

// modify this function
void function(char** c)
{
    *c = "Hello World!";
}

int main()
{
    char* s;
    function(&s); // modify here
    puts(s);
    return 0;
}

The question we want to answer is why doing so works?

Explanation

We acquire key data from GDB as following:

GDB command	result
`p s`	`0x7fff5fbff360 ""`
`p &s`	`(char **) 0x7fff5fbff340`
`p c`	`(char **) 0x7fff5fbff340`
`p *c`	`0x7fff5fbff360 ""`
`p c`	`(char **) 0x7fff5fbff340`
`p *c`	`0x100000f8e "Hello World!"`

Note that the last two commands are executed after *c = "Hello World!";. The state of the variables on the stack shown below:

Note that one can think about a variable in C as an alias for some virtual memory address. In other words, variable s and address 0x7fff5fbff340 are the same thing and we use variable as a shortcut to reference some address. For a given variable name, we can get its address by using & (i.e., When & used, the address of that variable is returned, instead of the variable itsef). In our case, &s is 0x7fff5fbff340. Since s itself is a pointer, which by definition, contains a memory address instead of a value. In our case, the memory address in s is 0x7fff5fbff360, which contains "" (note that "" value is undefined. It could be any value).

We pass &s into the function because inside the function, if we modify the content on the address 0x7fff5fbff340 (i.e. represented by &s), we can still reference 0x7fff5fbff340 once the function exits. It's because we can still access s, and s and 0x7fff5fbff340 are the same thing. Whatever change made to the content on 0x7fff5fbff340 will be accessible by s as well. Since s has type char*, then naturally &s corresponds with type char**. Another way of understanding char** is that we want to change the value of the passed in argument and C, by default, pass the argument by copying the value. Thus, we need to pass in a pointer to that value, not just the value itself.

Inside the function, we modify the content on the address 0x7fff5fbff340 by deferencing c (i.e. *c), which holds a copy of 0x7fff5fbff340. After *c = "Hello World!";, the content on the address 0x7fff5fbff340 changed to 0x100000f8e, which contains "Hello World!". Once we are done with the function and back to main, since s is the alias to 0x7fff5fbff340 and 0x7fff5fbff340 contains address to "Hello World!", our task is accomplished.

Note

One thought I had when I finished this post was why can't I pass s instead of &s because if s contains some address (say 0xab) and we modify the content on that address (0xab) to be "Hello World!". Since s contains It seems that there is another option we can use. However, as pointed out by others, the problem is that s is uninitialized: whatever we do with the address contained in s is undefined behavior. Undefined behavior means there is no predictability of the program: anything can happen. Thus, even we can print out the string, we still consider doing so wrong.

Hope this short writeup helps!

--- 10/15/19 UPDATE ---

Addtional perspective to understand why &s works: a pointer is just a regular variable that holds some memory address of another variable. Now, we want to instead of holding the memory address of some random content (e.g., 0x7fff5fbff360), hold the memory address of string "Hello World!" (e.g., 0x100000f8e). A natural choice is to pass in the memory address of the variable that we want to modify the value it holds, which in this case is &s.

Graph basics + Topological Sort

2018-07-14T22:30:00+08:00

Introduction
Graph Basics
- Definitions
- Representation
Topological Sort
Reference

Introduction

In this post, we breifly summarize the basic concepts related to graph algorithm. Then, we study topological sort to make the graph concepts into practice.

Graph Basics

Definitions

A graph $G = (V,E)$ consists of a set of vertices $V$, and a set of edges $E$.
Each edge (i.e., arcs) is a pair $(v,w)$, where $v,w \in V$.
Given an edge $e = (u,v)$, the vertex $u$ is its source, and $v$ is its sink.
If the pair is ordered, then the graph is directed (i.e., digraphs).
Vertex $w$ is adjacent to $v$ iff $(v,w) \in E$.

Note

In digraph, $(v,w)$ is not the same as $(w,v)$. Thus, if $(v,w) \in E$ and $(w,v) \not\in E$, $v$ is adjacent to $w$ but $w$ is NOT adjacent to $v$. However, for the undirected graph, if $(v,w) \in E$, then $(w,v) \in E$. Thus, if $v$ is adjacent to $w$, then $w$ is adjacent to $v$ in undirected graph.

A path in a graph is a sequence of vertices $w_1, w_2, w_3, \dots, w_N$ such that $(w_i, w_{i+1}) \in E$ for $1 \le i < N$. The length of such path is the number of edges on the path, which is equal to $N - 1$.

Note

We allow a path from a vertex to itself
- If this path contains no edge, then the path length is 0
- If the graph contains an edge $(v,v)$ from a vertex to itself, then the path $v,v$ is referred to as a loop

A simple path is a path such that all vertices are distinct, except that the first and last could be the same.
If there is a path from $u$ to $v$, $v$ is said to be reachable from $u$.
A cycle in a graph:
- For directed graph, a cycle is a path of length at least 1 such that vertices $w_1 = w_N$.
- For undirected graph, we require edges to be distinct
  - reasoning: the path $u,v,u$ in an undirected graph should not be considered a cycle because $(u,v)$ and $(v,u)$ are the same edge.
A directed acyclic graph (DAG) is a directed graph in which there are no cycles (i.e., paths which contain one or more edges and which begin and end at the same vertex)
- Vertices in a DAG which have no incoming edges are referred to as sources
- Vertices which have no outgoing edges are referred to as sinks
connected:
- An undirected graph is connected if there is a path from every vertex to every other vertex.
- A directed graph is connected if it contains a directed path from $u$ to $v$ or a directed path from $v$ to $u$ for every pair of vertices $u$ and $v$.
A directed graph is strongly connected if it contains a directed path from $u$ to $v$ and a directed path from $v$ to $u$ for every pair of vertices $u$ and $v$.
If a directed graph is not strongly connected, but the underlying graph (without direction to the arcs) is connected, then the graph is said to be weakly connected.
For a graph $G$, a connected component is a maximal set of vertices $C$ such that each pair of vertices in $C$ is connected in $G$. Every vertex belongs to exactly one connected component.
A complete graph is a graph in which there is an edge between every pair of vertices.

Note

A tree is a special sort of graph - it is an undirected graph that is connected but has no cycles. Given a graph $G = (V, E)$, if the graph $G' = (V, E')$ where $E' \subset E$, is a tree, then $G'$ is referred to as a spanning tree of $G$.

Indegree of a vertex $v$ is the number of edges $(u,v)$
Outdegree of a vertex $v$ is the number of edges $(v,u)$

Representation

Adjacency matrix: use a $|V| \times |V|$ matrix indexed by vertices, with a 1 indicating the presence of an edge (i.e. For each edge $(u, v)$, we set A[u][v] to true; otherwise the entry in the array is false). If the edge has a weight associated with it, then we can set A[u][v] equal to the weight and use either a very large or a very small weight as a sentinel to indicate nonexistent edges.
- Disadvantage: we require graph to be dense: $|E| = \Theta(|V|^2)$, which is very unlikely.
Adjacency list: For each vertex, we keep a list of all adjacent vertices. For undirected graph, each edge $(u,v)$ appears in two lists
- Advantage: only requires $O(|E|+|V|)$ space.
Edge Lists: we represent the graph as an array of $|E|$ edges. For example, for an undirected edge connects $0$ and $1$, we can represent it as [0,1].

Note

Checkout Khan Academy::Computer Science::Algorithms::Representing graphs for a nice example.

Topological Sort

Definition and Properties

We have following two equivalent definitions:
- Def 1: A topological sort is an ordering of vertices in a DAG such that if there is a path from $v_i$ to $v_j$, then $v_j$ appears after $v_i$ in the ordering.
- Def 2: A topological ordering of a DAG $G$ is a labeling $f$ of $G$'s nodes such that:
  - The $f(v)$'s are the set ${1,2, \dots, n}$
  - $(u,v) \in G \implies f(u) < f(v)$

Application: sequence tasks while respecting all precedence constraints. (e.g. course prerequisite structure can be represented as a graph. A topological ordering of these courses is any course sequence that does not violate the prerequisite requirement.)
If G has a cycle, there is no topological sort: since for two vertices $v$ and $w$ on the cycle, $v$ precedes $w$ and $w$ precedes $v$. On ther other hand, if there is no directed cycle in the graph, we can compute topological sort in linear time ($O(|V|+|E|)$).
Topological sorting is not necessary unique as shown in the picture above.

DFS Approach

The basic idea of computing the topological ordering is following:

Let $v$ be a sink vertex of $G$
set $f(v) = n$
recurse on $G - {v}$

There are some proofs we need to show for the correctness of the procedure:

Every directed acyclic graph has a sink vertex

Suppose the DAG doesn't have a sink vertex, that means every single vertex has at least one outgoing arc. We can start with arbitrary vertex and follow its outgoing arc to the next vertex. Since there is no sink vertex in our graph, we can repeatedly follow the outgoing arc of the vertex. Suppose there are $N$ nodes in the graph and by following edges for $N$ times, we reach the $N+1$th vertex. Since among the $N+1$ nodes, there are only $N$ distinct nodes. By the pigeonhole principle, we must have visted some vertex twice. By following the nodes and visited some node twice, we show that the graph contains a directed cycle, which is a contradiction.
During each recursion step, we can find a sink vertex

For a DAG, if we delete one or some of the vertices, we still have DAG (i.e., we cannot create a directed cycle). Thus, in each recursion step, we always have DAG. Then, by the previous observation, during each recursion step, we can find a sink vertex.
The above steps do produce topological ordering

By topological ordering, we know that all the edges have to go forward. Intutively, we always want to assign the sink vertex of the graph to the final position because otherwise there is going to be an outgoing arc of the node and the node that the outgoing arc points to will be assigned a lower position, which violates the topological ordering (i.e. edge goes backward). In our procedure, when a node $v$ is assigned to position $i$, that means we only have $i$ nodes remaining and $v$ is the sink vertex. This implies that all of outgoing arcs and the corresponding nodes are deleted and assigned higher positions. So for every vertex, by the time it actually gets assigned a position, it's a sink and it only has incoming arcs from the as yet unsigned vertices. It's outgoing arcs all go forward to vertices that were already assigned higher positions, and got deleted previously from the graph.

To implement the procedure above, we use the DFS:

There are several points we need to note here:

We set $f(s) = \text{current_label}$ right before we about to pop the call stack. At that point, for every edge $(s,v)$, there is no such $v$ that we haven't explored. That means there are no outgoing edges, which indicate that $s$ is a sink and thus we can assign it a labeling.
Running time: $O(|E|+|V|)$ (we only visit each vertex in the graph once and we look at each edge once as well)
Correctness: we want to show that this DFS algorithm can correctly produce topological ordering. Topological ordering requires that for an edge $(s,v)$, $f(s) < f(v)$. There are two possible cases for DFS: 1) $s$ get visited first 2) $v$ get visited first. For the first case, since there is an edge from $s$ to $v$, DFS will recursively call on $v$. In other words, DFS call on $v$ will finish before the DFS call on $s$. Thus, $v$ will get a label larger than $s$ and the topological ordering is satisfied. For the second case, since there is no cycle in DAG, $s$ will not get discovered. Thus, $s$ will be visited later than $v$. By the same reasoning as the first case, we still have the topological ordering.

BFS Approach

Not surprisingly, we can find topological ordering of a graph using BFS as well. Instead of finding the sink vertex each time (i.e. the vertex with outdegree = 0), we find the source vertex (i.e. the vertex with indegree = 0) each time in BFS. The basic steps to compute the topological ordering follows:

Let $s$ be a source vertex of $G$
set $f(s) = 1$
recurse on $G - {s}$

We omit the proofs of the properties in BFS as the proofs will mirror with the ones for DFS. We can use the BFS to implement the procedure above:

There are several points we need to note here:

In the basic version, we pick a source vertex of $G$ each time and assign the label. Inevitably, we will compute all the indegree of all nodes in the graph to find the source vertices. However, not all nodes' indegrees will be updated. To save this duplicate calculation, we use a queue (box).
Running time: $O(|E|+|V|)$ (We visit each edge once and for each node, we visit twice: compute the inital indegree; assign the labeling)
Correctness: this BFS will prodcue the topological ordering because for an edge $(s,v)$, we will always visit $s$ before visiting $v$. Without removing $s$ first, $v$ will always have an incoming edge, which will not make $v$ a source vertex. Since we assign the labeling in the increasing order, $f(s) < f(v)$. Thus, we produce a topological ordering.

As one can see the difference between DFS and BFS is that: for DFS, we start with the sink vertex and assign the label of the vertices in the decreasing order (i.e. start from $n$ and until $1$). However, for BFS, we start with the source vertex and assign the label of the vertices in the increasing fashion (i.e. start from $1$ and until $n$).

Reference

"Data Structures and Algorithm Analysis in C++, 4th Edition" by Mark A. Weiss (we use MAW (cpp) for short in the future)
"Elements of Programming Interviews: The Insiders' Guide" by Adnan Aziz, Tsung-Hsien Lee, and Amit Prakash, p.342 - 346 (we use ATA for short in the future)
Topological Sort on Coursera

Flash-based SSD Basics

2018-07-01T22:00:00+08:00

Introduction
Raw Flash
From Raw Flash to Flash-Based SSDs
Summary

Introduction

In this post, we highlight the key points from Chapter 44: Flash-based SSDs from OSTEP.

Solid-state storage device (SSD) is built out of transistors (like memory and processors) but it can retain information without power. First, we introduce the physical properties of the raw flash. Then, we focus on building a persistent storage device (i.e. NAND-based flash-based SSD)based on those physical properties.

Please note that I organize the post based on my own understanding of the chapter: the organization may not reflect the actual organization of the chapter. I also add a few illustrations to reflect the concepts in the chapter. Black & white pictures are taken from the book while the color ones are the drawings on my own.

Raw Flash

In this section, we introduce the raw flash using the bottom-up approach by first introducing the basic building block: transistor (i.e. cell). Then, we organize those cells into flash planes, which consist of physical blocks and pages. Finally, we introduce the basic operations supported by the raw flash and possible performance and reliability considerations when we build a persistent storage device.

Storing a Single Bit

Single-level cell (SLC) flash: a transistor (cell) stores a single bit (1 or 0)
Multi-level cell (MLC) flash: a transistor (cell) stores two bits (00, 01, 10,11)
Triple-level cell (TLC) flash: a transistor (cell) encodes 3 bits.

From Bits to Banks/Planes

Flash chips are organized into banks or planes, which consists of a large number of cells.

A bank (plane) is accessed in two different sized units:
- blocks (erase blocks): 128KB or 256KB
- pages: 4KB

Basic Flash Operations

Three low-level flash chip operations:
- Read (a page):
  - read any page by specifying page number
  - random access device: being able to access any location uniformly quickly (regardless of location on the device and the location of previous request)
- Erase (a block):
  - Before writing to a page within a flash, the device needs to first erase the entire block the page within
  - Need to make sure we save the contents of the to-be-erased blocks before executing the erase
  - The entire block is reset and each page within is read to be programmed
- Program (a page):
  - Modified the page and write the modification to flash

Note

We use $\texttt{INVALID}$, $\texttt{ERASED}$, and $\texttt{VALID}$ to represents three states of a page. One should note that write to a page with state $\texttt{E}$ doesn't cause the entire block to be erased. However, to write to a page with state $\texttt{V}$, the device requires the whole block to be erased.

Flash Performance and Reliability

wear out: when a flash block is erased and programmed, it slowly accures a little bit of extra charge. Over time, as that extra charge builds up, it becomes increasingly diffcult to differentiate between a 0 and a 1. At the point where it becomes impossible, the block becomes unusable.
disturbance: when accessing a particular page within a flash, it is possible that some bits get flipped in neighboring pages; such bit flips are known as read disturbs or program disturbs, depending on whether the page is being read or programmed, respectively.

From Raw Flash to Flash-Based SSDs

Goal: standard storage interface is bocked-based one, where blocks (sectors) of size 512 bytes can be read or written, given a block address. Thus, flash-based SSD is to provide that standard block interface atop the raw flash chips inside it.

The organization of Flash-based SSD

SSD consists of:
- Flash chips: for persistent storage
- Volatile memory (SRAM): caching, buffering data, mapping tables
- Control logic (FTL)

Build Flash Translation Layer (FTL)

Flash Translation Layer (FTL):
- Translate client reads & writes on logical blocks -> flash read, erase, program on physical blocks & pages
- performance:
  - Use multiple flash chips in parallel
  - Reduce write amplification: the total write traffic (in bytes) issued to the flash chips by the FTL divided by the total write traffic (in bytes) issued by the client to the SSD.
- reliability:
  - wear leveling (prevent wear out): spread writes across the blocks of the flash as evenly as possible, ensuring that all of the blocks of the device wear out at roughly the same time;
  - Prevent disturbance: program pages within an erased block in order, from low page to high page
Direct-mapped FTL:
- 1-1 mapping between logic page and physical page
  - Read of a logical page $N$ mapped to read of a physical page $N$ directly
  - Overwrtie of a logical page $N$ leads to the write amplification:
    - Read in the entire block that contains physical page $N$
    - Erase the block
    - Program the page $N$ along with the old pages within the block
- Can lead to wear out if the user repeated update the same logical page (e.g., update the same file system metadata over and over)
Log-structured FTL:
- Upon a write to a logical block $N$, the device appends the write to the the next free spot in the currently-being-written-to block.
- Write: the SSD finds a location for the write, usually just picking the next free page; it then programs that page with the block's contents, and records the logical-to-physical mapping in its mapping table.
- Read: subsequent reads use the table to translate the logical block address presented by the client into the physical page number required to read the data.
- Advantages:
  - We avoid the overwrite of the physical page (by always writing to the next free page), which can cause the expensive erase operation and write amplification.
  - FTL spreads the write across all pages and perform wear leveling to increase the lifetime of the device.
- Disadvantages:
  - Need to periodically perform garbage collection, which can increase write amplification and reduce performance
  - High cost of in-memory mapping tables (the larger the device, the more memory such tables need)
- crash recovery:
  - Since mapping table is stored in memory, we may lose it when device loses power. To handle this, we can store some mapping information in out-of-band (OOB) area within each page and reconstruct the mapping table in memory.
  - Use logging and checking to speed up recovery.

Garbage Collection (dead-block reclamation):
- Garbage: old versions of data around the drive that takes up the space
  - Ex: immediately follow the picture above, we write(100) with content "c1". The original "a1" is no longer needed, which is considered as garbage. We need to reclaim the physical page that "a1" takes.
- Garbage collection: the process of finding garbage blocks and reclaiming them for future use. We can find a block that contains one or more garbage pages, read in the live (non-garbage) pages from that block, write out those live pages to the log, and (finally) reclaim the entire block for use in writing.
- Determine the dead pages: the physical block contains the logical block addresses it is holding. We can then determine the dead pages by comparing the logical block addresses in the mapping table with the logical block addresses in the physical block (e.g., physical block holds logical block address 2000 but 2000 inside the mapping table pointing to the physical page that is outside of the current physical block. Thus, we know the physical page that holds 2000 inside the physical block is the dead page).
- The ideal candidate for reclamation is a block that consists of only dead pages; in this case, the block can immediately be erased and used for new data, without expensive data migration.
- Reduce GC costs: overprovision the device (adding extra flash capacity)
Block-Based Mapping:
- Instead of keeping one record per page in the mapping table, we keep one record per block. Doing so will reduce the size of mapping table by a factor of $\frac{Size_{block}}{Size_{page}}$.
- Read: The read of a logical block address is shown in picture below. The whole read process greatly mimics the virtual address translation. The mapping table plays a role as the page table in the virtual memory system (map virtual pages to physical frames).
- Write: if the client writes to logical block 2002 with content $c'$, since by the current mapping, we try to overwrite the physical page with new content, FTL has to perform erase. FTL will read in 2000, 2001, and 2003 and then write out all four logical blocks in a new location (e.g. physical pages 08,09,10,11 with values a, b, c', d), updating the mapping table accordingly and erase the original block. We can transfer the a,b,c',d back to the original block but that will involve another set of writes, which are expensive compared with updating the mapping table record.
- Disadvantage: performance decrease for the writes smaller than the physical block size of the device (If the writes equal to the physical block size, we can erase the whole block and do write directly instead of saving some old data and rewrite them again into new location.)

Hybrid Mapping
- Combine the page-level mapping (enable flexible writing) + block-level mapping (reduce mapping costs)
- FTL structure is shown in the picture below.
- One big challenge in the hyprid mapping FTL is the "compaction", which means we have to move the contents from log blocks referred by the log table into the physical blocks referred by the data table. The motivation is that we want to keep the size of log table small (i.e. reduce mapping costs). There are three ways we can perform depends on the contents of the blocks: switch merge, partial merge, full merge shown in the picture below.

Note

"compaction" isn't a term used in the chapter. I use the term as a way to indicate the whole moving log block process greatly mimics how the compaction works in the Log-structured merge (LSM) tree. Log blocks can be thought of as "memtable" in the LSM-based key-value store. In addition, when we look for a particular logical block, the FTL will first consult the log table; if the logical block’s location is not found there, the FTL will then consult the data table to find its location and then access the requested data. Also, we need to periodically scan the log table and corresponding log blocks to form blocks pointed by only one block pointer. All these behaviors are similar to how read and compaction work in the LSM-based key-value store.

wear leveling
- Log-structured approach + garbage collection helps with the wear leveling
- Problem: sometimes a block will be filled with long-lived data that does not get overwritten; in this case, garbage collection will never reclaim the block, and thus it does not receive its fair share of the write load.
- Solution: the FTL must periodically read all the live data out of such blocks and re-write it elsewhere, thus making the block available for writing again.

Flash-based SSD performance

The biggest difference in performance, as compared to disk drives, is realized when performing random reads and writes.

Findings in the above table:
- SSD random I/O outperforms HDD random I/O
- SSD sequential I/O is slightly above HDD sequential I/O (i.e. HDD still in the game for the sequential I/O task)
- SSD random write beat random read
  - log-structured design of many SSDs, which transforms random writes into sequential ones and improves performance.
HDD is still cheaper than SSD

Summary

Overwriting a page requires us to erase the whole block that the page resides in before we can write the page. This naturally introduces the write amplification as we must first move any data we care about to another location.
When design a persistent storage device based on flash, we need to think about performance (e.g. write amplification) and reliability (e.g. wear out, disturbance).
Log-structured hybrid mapping FTL provides an interface that maps the I/O to logical address space to the physical blocks & pages on the flash chips while maintaining good performance and reliability.

Generalized binary search

2018-06-10T01:24:00+08:00

Introduction
Example
Conclusion
Reference

Introduction

In the earlier post, we introduce the invariant concept to enable us to solve the binary search problem on the very first try. In this post, we further elaborate the binary search idea and introduce how we can use predicate + main theorem to solve more generalized binary search problem.

Example

Let's consider an example, which utilizes the generalized idea of binary search mentioned in TopCoder's article. The problem we look at is Leetcode 658. Find K Closest Elements: Let $A$ be a sorted array of $N$ values. We want to find the index $j$ such that the elements $A_j,A_{j+1},\dots,A_{j+k−1}$ have the $k$ closest values to the given target value $T$.

The generalization of binary search is done by formalizing how we reduce the search space by half: binary search can be used if and only if for all $x$ in the search space $S$ (i.e., the ordered set), $p(x)$ implies $p(y)$ for all $y > x$ ($p$ stands for some predicate over $S$). TopCoder article calls this formalization as main theorem. We use this theorem to discard the second half of search space. For example, in the most basic binary search problem in the ascending order array, our predicate $p$ is defined as: is the value at i smaller than X? If answer is yes, we discard all the values with index smaller than $i$ because given the ascending order and A[i] is smaller than X, any value comes before A[i] is also smaller than X. By the same reasoning, if the answer is no, we discard the second half. We make some observations here:

As you may have noticed, predicate is exactly what we check in the if-statement (e.g. X > A[i]).
If X < A[i], then for any $y > i$, we have X < A[y], which exactly matches the main theorem and that's how we can discard the second half of the array (i.e., search space).

Now, let's consider our example. What does it mean a selected subrange is optimal (i.e., $k$ closest values to $T$)? That means that we can neither move the subrange to the right ($|A_{j+k} - T| > |A_j - T|$) nor move the subrange to the left ($|A_{j+k-1} - T| > |A_{j-1} - T|$). In details, since the subrange includes $k$ closest values to $T$ and by moving it to right, we exclude $A_{j}$ and include $A_{j+k}$. Since the selected subrange is optimal, we must have $|A_{j+k} - T| > |A_j - T|$. Thus, we can formalize our predicate as: for a given index $j$, does $|A_{j+k} - T| > |A_j - T|$ hold? Another piece of information we need for binary search is the invariant. From the question description we can see that the key to this question is finding $j$. Thus, our invariant is: the index of the first number that is among the k closest values for the given target $T$ (i.e., $j$) is in $[\text{left}, \text{right}]$.

Now, we have to show that main theorem holds given our predicate formalization. Let's discuss this point in details:

When $|A_{j+k} - T| > |A_j - T|$ is true:
- If $A$ is sorted in ascending order, then we have three possible cases:
  - $T < A_j < A_{j+k}$. In this case, we have $A_{j+k} - T > A_j - T$. Let $d$ be an integer with range $0 < d < k$, then we have $A_{j+k+d} - T > A_{j+k} - T > A_{j+d} - T > A_{j} - T$. In other words, for any index $i > j$, our predicate holds ($|A_{i+k} - T| > |A_i - T|$). Thus, we can directly discard the second half of the array. Note that we still want to keep $j$ because it might be the optimal $j$ we are looking for.
  - $A_j < T < A_{j+k}$. In this case, we have $A_{j+k} - T > T - A_j$. Then, we have $A_{j+k+d} - T > A_{j+k} - T > T - A_j > T - A_{j+d}$. Then the predicate still holds for $i > j$.
  - $A_j < A_{j+k} < T$. This is impossible given our predicate condition.
- If $A$ is sorted in descending order, we also have three possible cases:
  - $T > A_j > A_{j+k}$. In this case, we have $T - A_{j+k} > T - A_j$, which leads to $T - A_{j+k+d} > T - A_{j+k} > T - A_{j+d} > T - A_j$. Again, our predicate holds for any index $i > j$ and we can discard the second half of the array.
  - $A_j > T > A_{j+k}$. In this case, we have $T - A_{j+k} > A_j - T$. We have $T - A_{j+k+d} > T - A_{j+k} > A_j - T > A_{j+d} - T$. Predicate holds.
  - $A_j > A_{j+k} > T$. Impossible.

Note

The reason we have $0 \le d \le k$ is that in the extreme case, we have $j = N - k$ (otherwise, we won't have $k$ elements) and it is unnecessary to have $d$ goes beyond $k$.

When $|A_{j-1} - T| > |A_{j+k-1} - T|$ is true:
- If $A$ is sorted in ascending order, then we have three possible cases:
  - $T < A_{j-1} < A_{j+k-1}$. Impossible.
  - $A_{j-1} < T < A_{j+k-1}$. In this case, we have $T - A_{j-1} > A_{j+k-1} - T$. Then, let $d$ be an integer with range between 0 and $j-1$. We have $T - A_{j-1-d} > T - A_{j-1} > A_{j+k-1} - T > A_{j+k-1-d} - T$. Thus, for any index $i <= j$, we have $|A_{i-1} - T| > |A_{i+k-1} - T|$. This suggests that we can discard first half of the array.
  - $A_{j-1} < A_{j+k-1} < T$. We have $T - A_{j-1} > T - A_{j+k-1}$, which implies $T - A_{j-1-d} > T - A_{j-1} > T - A_{j+k-1-d} > T - A_{j+k-1}$, which again the predicate holds.
- If $A$ is sorted in descending order, then we have three possible cases:
  - $T > A_j > A_{j+k}$. Impossible.
  - $A_j > T > A_{j+k}$. In this case, we have $A_{j-1} - T > T - A_{j+k-1}$, which implies that $A_{j-1-d} - T > A_{j-1} - T > T - A_{j+k-1} > T - A_{j+k-1-d}$. predicate holds: for all index $i <= j$, we have $|A_{i-1} - T| > |A_{i+k-1} - T|$, which means we can discard first half of the array and move subrange to the right.
  - $A_j > A_{j+k} > T$. In this case, we have $A_{j-1} - T > A_{j+k-1} - T$, which imples that $A_{j-1-d} - T > A_{j-1} - T > A_{j+k-1-d} - T > A_{j+k-1} - T$. predicate holds.

Once we verify the predicate satisifies the main theorem, the only thing we left is to build the connection between the invariant and predicate, and make sure the invariant holds during the loop execution. Let's first list out the code:

class Solution {
public:
    vector<int> findClosestElements(vector<int>& arr, int k, int x) {
        int left = 0;
        int right = arr.size() - k;
        while (left < right)
        {
            int mid = left + (right - left) / 2;
            if (fabs(x - arr[mid]) <= fabs(arr[mid+k] - x))
            {
                right = mid;
            }
            else if (fabs(x - arr[mid-1]) > fabs(x - arr[mid+k-1]))
            {
                left = mid + 1;
            }
        }
        return vector<int>(arr.begin() + left, arr.begin() + left + k);
    }
};

Note

Notice that in the code we actually use $|A_{j+k} - T| \ge |A_j - T|$ instead of $|A_{j+k} - T| > |A_j - T|$. The reason is because whenever there is a tie, the smaller elements are always preferred. Consider [1,2,3,4,5] with $k = 4$ and $ T = 3$. Then, both [1,2,3,4] and [2,3,4,5] are the closest $k$ elements to the $T$ and sum of the elements to $T$ distance are the same, which is a tie. In this case, we prefer [1,2,3,4]. If we strictly follow the predicate, we end up with [2,3,4,5]. Switching $|A_{j+k} - T| > |A_j - T|$ to $|A_{j+k} - T| \ge |A_j - T|$ still maintains the invariant in the loop because when $|A_{j+k} - T| = |A_j - T|$, shifting the subrange to the right doesn't give any improvement and by set right to mid, we still ensure that the optimal $j$ falls inside $[\text{left}, \text{right}]$.

$|A_{j+k} - T| > |A_j - T|$ means we cannot move the subrange to the right to obtain the optimal subrange. We also show that under the condition, we can discard the second half of the array. mid represents $j$ in our condition and by not moving subrange to right, we are saying that the optimal $j$ has to be the left of mid. This implies that we can safely move set $\text{right}$ to mid and still maintains the invariant during the loop. On the other hand, $|A_{j-1} - T| > |A_{j+k-1} - T|$ means that we cannot move the subrange to the left to obtain the optimal subrange. We also show that the inequality allows us to discard the first half of the array. Since for given $j$ (mid), we have $|A_{j-1} - T| > |A_{j+k-1} - T|$. We cannot move subrange (indicating by $j$ or mid) to the left; we have to move to right. Thus, we set $\text{left}$ to mid+1 to narrow down the search space while maintainng the invariant unchanged.

Note

The above code is theoretically correct but it has fundamentally implementation issue: mid can be 0, which will lead to index out of bound error in else if (fabs(x - arr[mid-1]) > fabs(x - arr[mid+k-1])). C++ doesn't enforce index out of bound error (i.e., undefined behavior) and the above code can run successfully for certain complier on certain platform (leetcode obvious can). However, issue will happen if you directly translate the above logic to another language. A safe way to do is to replace the else if statement with else if ((mid >0 && fabs(x - arr[mid-1]) > fabs(x - arr[mid+k-1])) || fabs(x - arr[mid]) > fabs(arr[mid+k] - x)), which you can see is redundant and can be optimized. This is exactly what we are going to do next.

One thing to note that while(left < right) means we haven't found the optimal $j$ yet, which implies that we have to either move the subrange to left or move the subrange to right. This provides us the further opportunity to optimize the above code:

class Solution {
public:
    vector<int> findClosestElements(vector<int>& arr, int k, int x) {
        int left = 0;
        int right = arr.size() - k;
        while (left < right)
        {
            int mid = left + (right - left) / 2;
            if (fabs(x - arr[mid]) <= fabs(arr[mid+k] - x))
            {
                right = mid;
            }
            else
            {
                left = mid + 1;
            }
        }
        return vector<int>(arr.begin() + left, arr.begin() + left + k);
    }
};

In the first version, we check two conditions explicitly and do nothing if both conditions are not true. However, as we state in the previous paragraph, since we are still in the while loop, that means one of those two conditions will be true. In other words, there is no such case that both conditions are false and we are still in the loop. Thus, we can get rid of one of the conditions and use else instead. Another way of thinking is that we do nothing if both conditions are failed and thus this third do-nothing case can be combined with the second else if (fabs(x - arr[mid-1]) > fabs(x - arr[mid+k-1])) condition to form a else statement.

There is another optimization code proposal I find online, which I don't think it is correct:

class Solution {
public:
    vector<int> findClosestElements(vector<int>& arr, int k, int x) {
        int left = 0;
        int right = arr.size() - k;
        while (left < right)
        {
            int mid = left + (right - left) / 2;
            if (x - arr[mid] <= arr[mid+k] - x)
            {
                right = mid;
            }
            else
            {
                left = mid + 1;
            }
        }
        return vector<int>(arr.begin() + left, arr.begin() + left + k);
    }
};

But this code return the wrong answer for the following case: arr = [5,4,3,2,1], x = 2, k = 4. The above solution gives [5,4,3,2], which is wrong because [4,3,2,1] is the closest elements to 2. To see this, we can invoke the predicate: 5 is 3 units away from 2 but 1 is only 1 unit away from 2 ($|A_j - T| > |A_{j+k} - T|$), which implies we can shift the subrange to the right. More straightforward way is to simply calculate the sum of distance of each element: [5,4,3,2] has sum 3+2+1 = 6 while [4,3,2,1] has sum 2+1+1 = 4.

Conclusion

We give one example showing the essence of the binary search: main theorem, which is a formalization of how we discard values. Predicate helps us to find what to write in the if statement and invariant helps us to make sure we find the correct value. In this post, we go through a relative formal proof of the correctness of our predicate. One thing to note that, the proof is in fact induction: we use $d$ to show inequalities hold for any index $i > j$. A nicer but equivalent way we can do is simply use the induction and show $p(j+1)$ holds given $p(j)$ is correct (we actually do $p(j+d)$ holds given $p(j)$ is correct). Another point we should point out that we can derive the invariant from predicate: we try to find the index of the first number that is among the k closest values for the given target $T$. This is the exact same number that will first give "yes" response to our predicate.

Reference

Trie

2018-06-02T22:30:00+08:00

Trie is a classic data structure (Idreos et al., 2018) that is widely used in key-value store (Zhang et al., 2018; Wu et al., 2015). In this post, we describe the basics about the data structure.

Motivation
Data Structure
Implementation
Applications
Reference

Motivation

The main motivation for using trie is that we want to efficiently search for a word in a dataset of strings. We can use hash table or balanced trees for this task. However, trie has its unique advantages:

VS. Hash Table:

Hash table has $O(1)$ time complexity for looking up a key but it is not efficient for the operations:
- Finding all keys with a common prefix. We have to traverse all keys in hash table, which can be $O(n)$ ($n$ is the number of keys inserted). However, trie takes $O(k)$ ($k$ is the length of the prefix).
- Enumerating a dataset of strings in lexicographical order. There is a sorting on all strings (i.e. keys) and thus $O(n\log n)$. However, trie takes $O(n)$ time only.
Search in hash table can be $O(n)$ if there are plenty of hash collisions. However, trie only takes $O(m)$ ($m$ is the key length)
Compared to hash table, trie saves space when storing many keys with the same prefix.

VS. balanced trees:

Search in balanced tree can take $O(m \log n)$ time. However, trie only takes $O(m)$.

Data Structure

To avoid unnecessary complexity, we assume we are working with a collection of strings which consist of only lower case alphabetics.

A trie node contains two fields:
- An array of $R$ links (links), with each link representing one letter. A link connects two trie nodes together. In our example, we have $R = 26$.
- A boolean variable isEnd, which indicates whether we reach the end of a string. This is needed because if we are searching for a prefix, we should have isEnd = false. On the other hand, if we reach the end of a string, we have isEnd = true.

Note

Please note that isEnd = true doesn't indicate that we are at leaf node of the trie. The boolean only indicates whether we have reached the end of some string. In Figure 1, the end nodes of "wa" and "wax" are connected with each other. If we require that there is no common prefix for the string (e.g. strings in the dataset don't share the common prefix), we then don't need isEnd boolean variable.

Insert a key into trie:
- We insert a key by searching into the trie. We start from the root and search a link, which corresponds to the first key character. There are two cases :
  - A link exists. Then we move down the tree following the link to the next child level. The algorithm continues with searching for the next key character.
  - A link does not exist. Then we create a new node and link it with the parent's link matching the current key character. We repeat this step until we encounter the last character of the key, then we mark the current node as an end node and the algorithm finishes.
- Time complexity: $O(m)$ (In each iteration of the algorithm, we either examine or create a node in the trie till we reach the end of the key. This takes only $m$ operations.)
- Space complexity: $O(m)$ (In the worst case newly inserted key doesn't share a prefix with the the keys already inserted in the trie. We have to add $m$ new nodes, which takes us $O(m)$ space.)
Search for a key in a trie:
- Each key is represented in the trie as a path from the root to the internal node or leaf. We start from the root with the first key character. We examine the current node for a link corresponding to the key character. There are two cases:
  - A link exist. We move to the next node in the path following this link, and proceed searching for the next key character.
  - A link does not exist. If there are no available key characters and current node is marked as isEnd = true we return true. Otherwise there are possible two cases in each of them we return false:
    - There are key characters left, but it is impossible to follow the key path in the trie, and the key is missing.
    - No key characters left, but current node is not marked as isEnd. Therefore the search key is only a prefix of another key in the trie.
- Time complexity: $O(m)$ (In each step of the algorithm we search for the next key character. In the worst case the algorithm performs $m$ operations.)
- Space complexity: $O(1)$
Search for a key prefix in a trie:

The approach is very similar to the one we used for searching a key in a trie. We traverse the trie from the root, till there are no characters left in key prefix or it is impossible to continue the path in the trie with the current key character. The only difference with the mentioned above search for a key algorithm is that when we come to an end of the key prefix, we always return true. We don't need to consider the isEnd mark of the current trie node, because we are searching for a prefix of a key, not for a whole key.
- Time complexity: $O(m)$
- Space complexity: $O(1)$

Implementation

We implement the trie data structure in C++ here.

Applications

Trie is useful when we want to search some string based on the character appearance of the characters within the string:

Autocomplete
Spell checker
IP routing (Longest prefix matching
T9 predictive text
Solving Boggle
Huffman Codes ($\S$10.1.2 in MAW(cpp))

Note

We use a sequence of 1 and 0 to represent character in Huffman Codes. Thus, $R = 2$ (e.g. 1 and 0). However, to figure out the 0-1 encoding of each character, we cannot simply insert the character into trie. We should use Huffman algorithm instead.

Reference

Leetcode article on Trie

Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S Kester, and Demi Guo. The data calculator: data structure design and cost synthesis from first principles and learned cost models. In Proceedings of the 2018 International Conference on Management of Data, 535–550. ACM, 2018. ↩

Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. Lsm-trie: an lsm-tree-based ultra-large key-value store for small data. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, 71–82. USENIX Association, 2015. ↩

Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. Surf: practical range query filtering with fast succinct tries. In Proceedings of the 2018 International Conference on Management of Data, 323–336. ACM, 2018. ↩

"Weighted Voting for Replicated Data"

2018-05-15T12:20:00+08:00

Problem
Design Assumptions
Algorithm
Transactions & Consistency
Remarks
Reference

Problem

Design an algorithm to maintain the replicated data:

Efficiently update the replicas
Efficiently communicate with the replicas (i.e., read) to get the latest update

Design Assumptions

Each replicated object requires a version number
No version number change during a transaction

Note

The paper is written for transaction. However, the quorum idea universally apply. Thus, I omit some design assumptions related to transactions, which can be checked here.

Algorithm

Weighted Voting

Scene: a file is replicated across a set of replicas. We now need to read/write from this set of replicas to get/write the latest copy of the file.

Each replica (i.e., server, "Representative" in paper) gets M votes
Extra read-only copies get 0 votes (i.e., Weak Representatives)
Each file is assigned K votes
We require R + W > K when handle each file
- R: the number of replicas we need to read before replying to the clients
- W: the number of replicas we need to write before replying to the clients
- We have at least one overlapping replica between R and W
  - Guarantee at least one overlapping replica between R and W
  - Guarantee every read will always see the latest write
To read a file from a set of replicas, we gather R votes from the set of replicas
- Among R reads, we use the version number to detect which is the latest copy of the data and return to the client
To write a file, we write to the set of replicas that with total votes equal to W (e.g., W=3, Replica 1 has 2 votes, Replica 2 has 0 votes, and Replica 3 has 1 votes; then we need to write to Replica 1 and 3 to meet W requirement)

Quorum-based Reads and Writes

Quorum-based system (e.g., Dynamo) is one special case of Weighted Voting: M = 1; K = N (i.e., the number of replica in the system):

All reads go to R replicas
All writes go to W replicas
We require R + W > N

Tuning & Examples

R = 1 $\rightarrow$ reads are efficient, writes are slow (every replica has to be updated)
W = 1 $\rightarrow$ writes are efficient, reads are slow (every replica has to be read)

Example:

Let's consider Example 1 in the figure above. Representative 1 gets 1 vote and the other two get 0 vote. Replica with 0 votes are weak representatives, which are for read-only (i.e., local cached copy). We have R = 1, W = 1, K = 1 in this example. To read, we have to read Representative 1 because we need 1 vote to satisfy R = 1. At the same time, to write, we need to write to Representative 1 for the same reason. In this example, Representative 1 can be the the server in the clients-server architecture (e.g., NFS) and all the read/write have to go to the server. However, we can also set R = 0, which we can read from the local cached copy directly (Representative 2 & 3 have 0 votes, which satisfy R requirement). But, in this case, weighted-voting algorithm doesn't guarantee that we can get the latest copy of the file.

Giving each replica (server) one vote: decentralized quorum system with high availability, low performance
Giving one replica (server) all the votes: centralized system with high performance, low availability

Transactions & Consistency

Each read or write is an atomic, isolated operation at each copy
While the read is going on, there is no other writer at that copy (similarly for writes)
Transactional isolation:
- lock all files the tx wants to read/write; Perform reads/writes; Unlock
- guarantees serializable transactions
- Obtaining the locks has to be done with a total order, otherwise deadlock is possible
- A tx can hold locks for a max time period
  - to prevent certain transactions hold locks to long while others are waiting to obtain this lock

Note

On total ordering, we have seen partial ordering in Lamport's logical clock. However, partial ordering allows the existence of concurrent events. To make partial ordering to total ordering, we need to add "comparability", which means for any two events, we can tell the ordering of the events (no cocurrent event allowed). Lamport uses the PID to solve this. In weighted-voting, we enforce total ordering of locks to prevent deadlock. However, we don't enforce total ordering on transactions because operation in transactions can be interleaved and still guarantee the serializability (i.e., serial consistency). We could enforce total ordering on transactions but we cannot achieve the best performance in this case.

Three locks used: read lock, intention-to-write lock, commit lock
- Unlike write lock, intention-to-write lock allows the read lock because in serial consistency, all of a transaction's writes appear to occur at transaction commit time. Thus, write lock is less ideal because we don't need write lock (which prevents read) at the very beginning of the transaction.
- Writes appear to occur when they are issued, but in fact are buffered until commit time by the stable file system.
- At commit time I-Write locks are converted to Commit locks, and the writing actually takes place.

Note

Very interesting part of the paper on fine-grained locking management to improve concurrency of the system.

Remarks

The paper doesn't have a formal proof on R + W > K guarantees at least one overlapping replica between R and W. I think it would be fun to fill this gap:

Let's suppose there are $N$ servers and for each server $i$, we have $k_i$ with $k_i > 0$ votes. Then

$$ \begin{eqnarray*} R & = & \sum_{i=j}^m k_i \\ W & = & \sum_{i=h}^t k_i \\ K & = & \sum_{i=1}^N k_i \end{eqnarray*} $$

In words, we read server $j$ to $m$ to satisfy R votes requirement and we write server $h$ to $t$ to satisfy W votes requirement. Note that $j$ to $m$, for example, doesn't enforce that we have to pick servers in sequential order. We can always group the servers we pick for R together and give them the sequential numbering.

Now, let's assume there is no overlapping replica between R and W. That means, $\{j \dots m\} \cap \{h \dots t\} = \emptyset$. Then we have

$$ \begin{eqnarray} \sum_{i=j}^m k_i + \sum_{i=h}^t k_i & > & \sum_{i=1}^N k_i \\ \sum_{i=1}^N k_i - \sum_{i=h}^t k_i - \sum_{i=j}^m k_i & < & 0 \\ \sum_{i \in \{1 \dots h-1\} \cup \{t+1 \dots n\}} k_i - \sum_{i=j}^m k_i & < & 0 \label{1} \\ \sum_{i \in \{1 \dots h-1\} \cup \{t+1 \dots n\} \cup \{m+1 \dots n\}} k_i & < & 0 \label{2} \\ \end{eqnarray} $$

From equation $\eqref{1}$ to equation $\eqref{2}$, since there is no intersection between two sets of servers, we can pick either $\{1 \dots h-1\}$ or $\{t+1 \dots n\}$ to contain $\{j \dots m\}$ (we happen to choose latter one). In other words, any server in $\{j \dots m\}$ cannot in $\{h \dots t\}$ since there is no overlap based on our assumption. From the last equation, we see that the sum of selected votes are smaller than 0, which is contracdiction. We always assign positive votes to each server.

Reference

"The Andrew File System (AFS)"

2018-05-02T23:30:00+08:00

Problem
Design assumptions
Design
NFS vs. AFS
Remarks
Reference

Problem

Design a distributed file system that can scale: a server can support as many clients as possible?

Design assumptions

Most files were not frequently shared, and accessed sequentially in their entirety.
System is used for casual usage (e.g., when a user logs into a different client, they expect some reasonable version of their files to show up there.)
- Not for concurrent access & updates scenario

Design

Cache:
- whole-file caching:
  - AFS is whole-file caching on the local disk of the client machine that is accessing a file
    
    open() a file, the entire file (if it exists) is fetched from the server and stored in a file on your local disk. Subsequent application read() and write() operations are redirected to the local file system where the file is stored; thus, these operations require no network communication and are fast.
- Use client memory to cache blocks of file when access locally
- Contact the server (use TestAuth protocol) for future access of the file to see if client can use cache (i.e., no modification to the local cached file)
  - Advantage: no network transfer of the file
  - Disadvantge: too many contacts to server for cached file no-modification verification
    - sol: use callback
- cache both directories and file contents
  - motivation: server spends much CPU time traversing directories
  - client caches and requests callback to directories along the way to the target file
  - Sequential access assumption makes this technique works (e.g., access files within the same cached directory)
  - Much effort spent on the client side (path traversing & cache) alleviates the load for server
Callback:
- a way to reduce number of client/server interactions (mainly for TestAuth message verification)
- A callback is simply a promise from the server to the client that the server will inform the client when a file that the client is caching has been modified
- client assumes cached files are valid until the server tells it otherwise

Note

The idea of callback vs. TestAuth message is analogous to interrupt vs. polling

Cache consistency:
- consistency between processes on different machines:
  - update visibility sol: flush-on-close
  - cache staleness sol: server-initiated cache invalidation ("break" callback)
- consistency between processes on the same machine:
  - update visibility sol: writes to a file are immediately visible to other local processes (i.e., a process does not have to wait until a file is closed to see its latest updates) (same as UNIX semantics: tail a log file and can see the writes in real time)
Last writer wins (i.e., last closer wins) for concurrent modification of the same file
- The result is a file that was generated in its entirety either by one client or the other (unlike NFS, a file can contain blocks from different clients)
Load balancing:
- use volumes, which an administrator could move across servers to balance load
Building the server with thread instead of process per client to reduce the overhead (e.g. context switching)
Crash Recovery:
- Clients:
  - Client send out TestAuth message to validate its cache after recovery
- Servers:
  - callbacks are kept in memory -> need to validate the cached file
  - sol:
    - having the server send a message to each client after recovery to let clients start to validate their cache
    - clients send heartbeat message periodically to server
Even the cache is on disk, AFS can use client-side OS memory caching infrastructure to improve performance
AFS provides a true global namespace to clients, thus ensuring that all files were named the same way on all client machines.
- clients in NFS can mount NFS server anyway -> hard to administer
AFS has security and access-control lists

NFS vs. AFS

For large-file (greater than memory) sequential re-read, AFS > NFS:
- AFS use local disk to cache entire file
- NFS can cache blocks in memory and have to refetch the file for re-read
For access small subset of data within large files, NFS > AFS:
- AFS has to fetch entire file and send it back after modification
- NFS only read the blocks that need to be modified
- AFS is not good for append information to log periodically (little log writes that add small amounts of data to an existing large file)

	NFS	AFS
Cache unit	block of a file	whole file
Cache location	memory	local disk
Cache strategy	cache block only	cache directories and files
Cache invalidation	polling (issue GETATTR)	callback
Concurrent update of the same file	Blocks flushed to servers during update	Last Writer Wins
Crash Recovery	server crash is unnoticeable	complex crash recovery
Namespace	namespace is arbitrary across clients	single namespace to all clients

Remarks

Several commonly-seen design techniques:
- Force the clients to spend much more effort (cache directory and request callback) to reduce load on server
  - techniques to avoid DDOS attack in security
  - Mining in Bitcoin
Cache consistency in file system is incapable of handling a file access from multiple clients (i.e., concurrent access)
- Need to implement explicit file-level locking on top of file system
- Need extra mechnaism to handle conflicts (e.g., concurrent updates):
  - Google doc use git-like operational transformation to resolve conflict
Dropbox is inspired by AFS
The scalability in AFS is measured in terms of number of clients that a server can support. However, if we think about scability in terms of the number of servers, NFS wins out due the stateless protocol and simple crash recovery

Reference

"Petal: Distributed Virtual Disks"

2018-05-02T02:20:00+08:00

Problem
System design
Reference

Problem

Design a distributed storage system that is easy-to-use and easy-to-administer.

Easy-to-use:
- Simple Interface
- availability / fault tolerance
- transparency
- consistency
Easy-to-administer:
- crash recovery (no manual needed)
- scability (scale up with the workloads without performance degration)
- add / remove nodes
- load balancing
- monitoring

System design

Petal consists of a collection of network-connected servers that cooperatively manage a pool of physical disks

Note

Petal is different from distributed file system (NFS) in the sense that clients in NFS can directly access the server physical disks. However, Petal hides the physical resources through the layer of abstraction. The benefits of this abstraction: - scale to a large size and reliable data storage over in long run - support heterogeneous clients and client applications (e.g., different file systems) (Figure 2)

Clients (FS or DB) view Petal as a collection of virtual disks (Figure 1)
Disk-like Interface: data are read and written to Petal virtual disks in blocks (i.e., the basic tranfer unit) through RPC

Software modules:
- liveness module:
  - ensures that all servers in the system will agree on the operational status (running or crashed) of each other
  - majority consensus (Paxos) + heatbeat
- global state module:
  - include information: current members of system + currently supported virtual disks
  - consistently maintain information -> Paxos
Virtual disk address -> physical disk address:
- virtual disk address form:
- virtual disk identifier -> global map identifier (via. virtual disk directory)
- global map identifier decides the server for translating offset
- global map identifier & offset -> disk-identifier, disk-offset (via. phyiscal map on each server)

Note

Separation of the translation data structures into global and local physical maps: - keep bulk of mapping information local (minimizes the information kept globally, which is replicated and thus hard to update)

Global map:
- One global map per virtual disk that specifies the tuple of servers spanned by the virtual disk
- immutable -> new global map if change virtual disk's tuple of server / redundancy scheme
Backup:
- copy-on-write to create exact copy of a virtual disk at a specified point in time
- Use epoch-number as version number
- Create a snapshot consistent with client application level requires pauseing the application
- Can also use "crash-consistent" snapshot and later recovered by the application-specific recovery protocol
Add server:
- add to the membership of the Petal
- adjust liveness module to incorporate new server
- virtual disk reconfiguration (reconfigure existing virtual disks to use new resources -> data redistribution)
Virtual disk reconfiguration (data redistribution):
- data redistribution can take hours to finish (won't compete network & disk traffic with write/read serving)
- Basic steps:
  - create a new global map with redundancy scheme + server mapping
  - change all virtual disk directory entries that refer to the old global map to refer to the new one
  - redistribute data to servers according to new global map
    - start with the most recent epoch that have not yet been moved (not return old data when READ)
    - need to read/write during redistribution:
      - READ: try the new global map first, then the old global map
      - WRITE: use new global map
- Refinement:
  - don't need to change server mapping for an entire virtual disk before any data is moved (-> READ miss given new global map)
  - sol:
    - break virtual disk's address into: old, new, fenced
    - Requests to old/new use old/new global maps
    - Use "Basic steps" for fenced only
    - Once we have relocated everything in the fenced region, it becomes new region and we fence another part of the old region
  - tricks:
    - keep the relative size of the fenced region small
    - construct fenced region using small non-contiguous ranges distributed throughout the virtual disk (not single contiguous region b/c fenced region may be heavily used)

Data access and recovery:
- Use chained-declustered data access and recovery modules (chained-declustering):
  - Two copies of each block of data are always stored on neighboring servers
    - Server 1 fails, servers 0 and 2 will share server 1's real load; server 3 will not have load increase
    - Can offload load to a server to neighboring servers
    - similar to consistent hashing
- Dynamic load balancing scheme:
  - each client keep tracks of the number of requests it has pending at each server and always sends read requests to the server with the short queue length
  - Works for most of requests are from a few clients (not for many clients with occassional requests)
- Tolerate site failures:
  - all the even-numbered servers at one site and all the odd-numbered servers at another site (less reliable since data on a given server is replicate on neighboring only)
- one of two copies of each data block is denoted the primary. Rest are secondary.
- READ:
  - Read from either primary or secondary
  - Clients retry on failure
- WRITE:
  - Use primary always
  - Mask the data as busy. Simultaneously sends write requests to its local copy and the secondary copy. When both requests complete, the busy bit is cleared and the client that issued the request is sent a status code indicating the success or failure of the operation.
  - Optimization:
    - write-ahead-logging with group commits (batch the busy bits update)
    - cache the busy bits (avoid disk I/O to set busy bits)
  - Primary fail:
    - the secondary marks the data element as stale on stable storage before writing it to its local disk. The server containing the primary copy will eventually have to bring all data elements marked stale up-to-date during its recovery process.
Petal's limitation:
- High requirement to the network (use digital ATM Network)
- Petal's use of the virtual disk abstraction adds an additional level of overhead, and can prevent application-specific disk optimizations that rely on careful placement of data.

Reference

Petal: Distributed Virtual Disks

"Sun's Network File System (NFS)"

2018-05-01T02:20:00+08:00

Problem
System designs
Remarks
Reference

Problem

Design a distributed file system with transparent access to files from clients

System designs

the server stores the data on its disks, and clients request data through well-formed protocol messages
Architecture advantages:
- easy sharing of data across clients
- centralized administration (backup done on multiple servers instead of many clients)
- security (put server behind firewall)
Transparency:
- Location transparency: file name does not include name of the server where the file is stored
- Implemented using NFS Mount Protocol:
  - Mount remote directories as local directories
  - Maintain a Mount table with (directory, server) mapping

Clients talk to server using RPC:
- Use RPC to forward every file system request; remote server executes each request as a local request; server responds back with result (Example: OSTEP Figure 49.5)
- Advantage: server provides a consistent view of the file system to clients
- Disadvantage: performance (use cache)
Crash Recovery:
- goal: simple and fast server crash recovery
- Use a stateless Protocol (NFSv2): the server doesn't keep track of anything about what is happening at each client
- Stateful: server maintain a filedescriptor(an integer) to actual file relationship (unknown after recovery)
- Stateless: file handle (a unique identifier for each directory and file).
  - Every client RPC call needs to pass a file handle
  - Server returns file handle whenever needs (e.g., mkdir)
Server failure & Message loss:
- Client retries the request (READ, WRITE are idempotent in NFS)
Cache:
- Client side:
  - cache file data and metadata by block that is read from server in local memory
  - Cache serves as a temporary buffer for writes (allow asyncronous write)
  - Advantage: reduce network usage, improve performance
  - Disadvantage: write lost in memory after crash (safety vs. performance tradeoff)
- Server side:
  - server can buffer the write in memory and write to disk asychronously
  - Problem: write in memory can lost
  - Sol:
    - battery-backed memory
    - commit each WRITE to stable storage before ack WRITE success to clients
Cache consistency problem:
- Update visibility: when do updates from one client become visible at other clients?
  - sol: flush-on-close (write-back cache):
    - when a file is written to and subsequently closed by a client application, the client flushes all updates (i.e., dirty pages in the cache) to the server.
- Stale cache: once the server has a new version, how long before clients see the new version instead of an older cached copy?
  - sol: issue GETATTR to get file stats (last modified date), if the time-of-modification is more recent than the time that the file was fetched into the client cache, the client invalidates the cache and subsequent reads will go to the server.
  - Use attribute cache to reduce GETATTR requests (update attribute cache periodically)
  - Still has problem: can still read stale value (polling interval, cache update/invalidation delayed by network)

Note

You may think the solution to cache consistency problems look a lot like write-back + invalidation. The geenral idea is the same. However, the solution here takes client's perspective. However, the definitions in my previous post takes server's perspective. More formally, we call client's perspective "client-initiated consistency protocol" and server's perspective "server-initiated consistency protocol".

Remarks

NFS issues:
- multiple clients update the same file may get inconsistent view of the file (depends on cache update/invalidation, attribute cache polling frequency)
- Clients crash may lose data in buffer (cache)
NFS Key features:
- Location-transparent naming
- Client-side and server-side caching for performance
- Stateless architecture
- Client-initiated consistency protocol
Good in NFS:
- Simple
- Highly portable (open protocol)
Bad in NFS:
- Lack of strong consistency

Reference

"MapReduce: Simplified Data Processing on Large Clusters"

2018-04-30T23:20:00+08:00

Problem
Design assumptions
Programming model
Execution
System design
Remarks
Reference

Problem

Design a simple-to-use (no exposure of messy details of parallelization, fault-tolerance, data distribution and load balancing to user)programming model (i.e., abstraction interface) that can process large amount of data in a reasonable amount of time.

Design assumptions

Large clusters of commodity PCs connected together with switched Ethernet
- 2-4GB memory Linux machine
- 100 MB/s or 1 GB/s network bandwidth
- machine failures are common
- unreliable hardware
- distributed file system (use replication to provide availability and reliability)
- scheduler to schedule tasks to a set of available machines within a cluster

Programming model

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs
The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values.

Example:

Count of URL Access Frequency: The map function processes logs of web page requests and outputs ⟨URL, 1⟩. The reduce function adds together all values for the same URL and emits a ⟨URL, total count⟩ pair.

Execution

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits.
Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., $hash(key) \textbf{ mod } R$)
Execution Process:
1. The MapReduce library in the user program first splits the input files into M pieces. It then starts up many copies of the program on a cluster of machines.
2. One copy of the program is called Master. The rest are workers that are assigned work by the master. The master picks idle workers and assigns each one a map task or a reduce task.
3. Worker assigned map task do Map job and buffer the output in memory
4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
5. Reduce worker reads the buffered data from the local disks of the map workers using RPC. Perform Group by on the immediate keys.
6. Reduce worker do reduce job and append the output to a final output file of this reduce partition.
7. All Map and Reduce tasks are done, the master wakes up the user program and return the program control.
Map & Reduce tasks have three states: idle, in-progress, or completed

System design

In the execution, the output are buffered in memory and write to disk in batch to reduce disk I/O overhead.
Fault tolerance:
- Master pings every worker periodically to detect worker failure
- Completed map tasks are re-executed on a failure (intermediate output are stored in local disk and thus inaccessible)
- Completed reduce tasks are not re-executed on failures (results are in GFS, which are replicated already)
- Master Failure:
  - Periodic checkpoints
  - Aborts MapReduce computation if master fails (current implementation)
Sequential consistency: when Map and Reduce operation is determinstic to the input file, the distributed execution output is the same as the non-faulting sequential execution of the entire program:
- Atomic commits of map and reduce task outputs
Trade-off on each task size vs. M or R:
- $O(M + R)$ scheduling descisions
- $O(M \dot R)$ states in memory
- small task size, large M or R given the input job size is fixed
- Large task size, small M or R given the input job size is fixed (make parallel meaningless)
Handle "straggler":
- "Straggler": a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation.
- When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes.
Locality: schedule map tasks to the machines where the replicas of the input data is stored (input data can read locally and consume no network bandwidth). This is saying of "push program to the data node".
Refinements:
- Use a partition function on the intermediate keys such that, for example, all URLs from the same host to end up in the same output file. For ease-to-use from the application layer.
- within a given partition, the intermediate key/value pairs are processed in increasing key order. For efficient loopup in the output file.
- Optional Combiner function that does partial merging of the intermediate keys on the map task worker before the data is sent over the network to the reduce worker (reduce network usage)
- the MapReduce library detects which records cause deterministic crashes and skips these records in order to make forward progress.

Remarks

Re-execution to provide fault tolearance is a commonly-seen technique (used in Spark as well) in "Big-data" paper
Spawn multiple same tasks to handle "straggler" problem is also common
Locality is a nice trick to use to reduce bandwidth usage
Stonebraker's video on database research mentions that MapReduce is no longer used in Google. We as researcher should play critical on the papers produced by "whales" and not treat them as the golden standard (we do because we lose connection to industry). Totally different topic.
I don't think MapReduce is a database work. I feel it just some framework that allows Google to get their job done (If you think about Jeff Dean's major focus is on programming languages not database). Unlike database, the framework is hard to generalize.
Spot Remzi's papers in the reference section. Neat!

Reference

MapReduce: Simplified Data Processing on Large Clusters

"Scaling Distributed Machine Learning with the Parameter Server"

2018-04-30T01:30:00+08:00

Problem to solve
Challenges
Design assumptions
Design
Remarks
Reference

Problem to solve

Build a parameter server framework for distributed machine learning problems.

Challenges

machine learning model has $10^9$ to $10^{12}$ shared parameters that need to frequently access

High network bandwidth requirement
Synchronization cost and high machine latency can hurt model performance
Fault tolerance is critical

Design assumptions

Machine learning algorithms are quite tolerant to perturbations
Machine learning algorithms can be thought of consisting data + parameters

Design

Use a group of parameter servers
- can all serve one algorithm to have high availability
- can also run more than one algorithm simultaneously
A server node in the server group maintains a partition of the globally shared parameters.

Note

Partition might be a bad idea from fault tolearance and availability perspectives (the two main purposes behind doing replication). By doing partition, we can reduce the workload on each server but we still effectively use one server to serve a set of parameters, which makes the server group idea meaningless. In other words, we want to use several servers to serve the same set of parameters to increase fault tolerance and availability.

Parameter servers in the server group partition keys using consistent hashing with virtual nodes (See Dynamo Paper for details).
- Virtual nodes adv: improve load balancing & recovery
- Consistent hashing adv: failure locality (only three nodes are affected)
Live replication of parameters between servers supports hot failover
The communication of parameter updates and processing are batched to reduce network overhead (e.g. send a segment of a vector or entire row of matrix)
The communication is also compressed to reduce network usage
Parameters are stored using (key, value) vectors to facilitate linear algebra operations
Tasks are executed asynchronously: the caller can perform further computation immediately after issuing a task.
Flexible consistency: training iterations vs. throughput tradeoff
- Eventual consistency: all tasks may be started simultaneously. Highest throughput (i.e., system efficiency) but the algorithm may take more iterations to converge because the update may be on stale parameters. Thus, eventual consistency model is recommended for algorithms that are agnostic to delayed parameter value.
- Sequential consistency: all tasks are executed one by one. Lowest throughput but all parameters are guaranteed to be latest.
- Bounded Delay: consistency model between sequential consistency and eventual consistency.
Server node are replicated after aggregation to reduce network usage
Worker crash: we don't recover the worker node because:
- when training data is large, recover worker node is very expensive
- Losing a small amount of training data during optimization affects the model a little
We can spawn new task if one of machnies appear to be slow (to handle straggler problem)

Remarks

Worker nodes may need to access the auxiliary metadata. In design, we always need to think of using metadata.
A server manager node maintains a consistent view of the metadata of the servers, such as node liveness and the assignment of parameter partitions. This is the place to use Paxos.

Note

Paxos can also be used when we have small set of parameter servers. Their membership can be stored using Paxos. Usually, paxos cannot scale over 5 servers.

Machine learning algorithm tolerates stale data is the major point we exploit when design "Big data + ML" system

Reference

Scaling Distributed Machine Learning with the Parameter Server

"Dynamo: Amazon’s Highly Available Key-value Store"

2018-04-14T20:24:00+08:00

Problem, Motivation
System Assumptions and Requirements
Design Considerations
Design Requirement
Architecture
Remarks & Thoughts
Further Readings
Reference

Problem, Motivation

How to build a highly available (i.e. reliability) and scalable distributed key-value storage system?

System Assumptions and Requirements

Query Model: simple KV store operations: read & write
- store small objects (<= 1 MB)
- no operations span multiple data items; no need for relational schema
ACID properties: trade consistency in ACID for high availability
- no isolation guarantees
- permits only single key updates
Efficiency: commodity machines; stringent latency requirements specified by SLAs
- SLA (Sevice Level Agreement): a contract where a client and a service agree on several system-related characteristics: client’s expected request rate distribution for a particular API, the expected service latency under those conditions, etc.
- Dynamo:
  - SLAs measured at the 99.9th percentile of the distribution (e.g., a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second)
  - configurable: to meet service latency & throughput requirements

Note

Dynamo is an AP system
Measuring tail latency instead of average latency is because Dynamo wants to optimize the worst case scenario. In the system is designed for the common usage situations, average latency is a fine measurement.

Design Considerations

Strong consistency and high data availability cannot be achieved simultaneously
- Dynamo sacrifice strong consistency: Dynamo is designed to be an eventually consistent data store; that is all updates reach all replicas eventually.
Increase availability for system prone to server and network failures: Optimistic replication techniques
- changes are allowed to propagate to replicas in the background, and concurrent, disconnected work is tolerated.
- Conflicts must be detected and reolved:
  - When (whether conflicts should be resolved during reads or writes?):
    - Dynamo: resolved when read; writes are never rejected
  - Who (who performs the process of conflict resolution: data store or application?):
    - Dynamo: the application; data store implements a simple policy (i.e., "last write wins")
Incremental scalability: scale out one storage host (node) at a time with minimal impact
Symmetry: every node in Dynamo should have the same set of responsibilities as its peers (i.e., no special nodes)
Decentralization: design favors decentralized peer-to-peer techniques
Heterogeneity: system needs to be able to exploit heterogeneity in the infrastructure it runs on (e.g. the work distribution must be proportional to the capabilities of the individual servers)

Design Requirement

"Always writeable": no updates are rejected due to failures or concurrent writes
All nodes are assumed to be trusted
No support for hierarchical namespaces (a norm in many file systems) or complex relational schema
Built for latency sensitvie applications: at least 99.9% of read and write operations to be performed within a few hundred milliseconds

Architecture

Note

The architecture of a storage system in a production setting needs to include:

actual data persistence component
load balancing
membership and failure detection
failure recovery
replica synchronization
overload handling
state transfer
concurrency and job scheduling
request marshalling
request routing
system monitoring and alarming
configuration management

Note

The relation between Table 1 entry and the following sections:

Partitioning (Partitioning Algorithm, Replication)
High Availability for writes (Data Versioning)
Handling temporary failures (Execution of get () and put () operations, Handling temporary failures: Hinted Handoff)
Recovering from permanent failures (Handling permanent failures: Replica synchronization)

Partitioning Algorithm

Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts (i.e., nodes).
Consistent hashing:

the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring.
Pro & Cons of consistent hashing:
- Pro:
  - departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected
- Cons:
  - the random position assignment of each node on the ring leads to non-uniform data and load distribution
  - The basic algorithm is oblivious to the heterogeneity in the performance of nodes (i.e., some node may have a more powerful setup but consistent hashing treat it the same as others)
Dynamo uses a variant of consistent hashing:

Instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the ring. To this end, Dynamo uses the concept of “virtual nodes”. A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. Effectively, when a new node is added to the system, it is assigned multiple positions (i.e., “tokens”) in the ring.
Advantages of using virtual nodes:
- If a node becomes unavailable (due to failures or routine maintenance), the load handled by this node is evenly dispersed across the remaining available nodes
- When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes
- The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure

Replication

To achieve high availability and durability, Dynamo replicates its data on multiple hosts.

Each data item is replicated at N hosts, where N is a parameter configured “per-instance”. Each key, k, is assigned to a coordinator node (the node that a key is assigned to in consistent hashing; the first among the top N nodes in the preference list). The coordinator is in charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 clockwise successor nodes in the ring. This results in a system where each node is responsible for the region of the ring between it and its Nth predecessor.

Example: node B replicates the key k at nodes C and D in addition to storing it locally. Node D will store the keys that fall in the ranges (A, B], (B, C], and (C, D].

The list of nodes that is responsible for storing a particular key is called the preference list. Every node in the system can determine which nodes should be in this list for any particular key. To account for node failures, preference list contains more than N nodes. The preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes (first N virtual nodes may all be hosted by one physical node).

Data Versioning

Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously.
Dynamo allows multiple versions of the same object but if reconcilation fails, the client must perform the reconciliation in order to collapse multiple branches of data evolution back into one (semantic reconciliation).

Note

Works like Git: if Git can merge different modifications into one, Git is done automatically for you. If not, you (client) have to manually reconcile conflicts.

Dynamo uses vector clocks in order to capture causality between different versions of the same object.

Note

Each object (e.g., D5) contains a vector clock (e.g., ([Sx,3],[Sy,1],[Sz,1])). Note that it is not a list of vector clocks. You can think about Sx, Sy, Sz as process names to map to the original vector clock example.

Dynamo uses a clock truncation scheme to control the size of vector clocks

Execution of get () and put () operations

Any storage node in Dynamo is eligible to receive client get and put operations for any key
Read and write operations involve the first N healthy nodes in the preference list, skipping over those that are down or inaccessible (remember preference list contains more than N nodes).
To maintain consistency among its replicas, Dynamo uses a consistency protocol similar to those used in quorum systems.

This protocol has two key configurable values: R and W. R is the minimum number of nodes that must participate in a successful read operation. W is the minimum number of nodes that must participate in a successful write operation. Setting R and W such that R + W > N yields a quorum-like system. In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.

Note

Dynamo client applications can tune the values of N, R and W to achieve their desired levels of performance, availability and durability:

N determines the durability of each object
The values of W and R impact object availability, durability and consistency
- If W is set to 1, then the system will never reject a write request as long as there is at least one node in the system that can successfully process a write request
- low values of W and R can increase the risk of inconsistency as write requests are deemed successful and returned to the clients even if they are not processed by a majority of the replicas; This also introduces a vulnerability window for durability when a write request is successfully returned to the client even though it has been persisted at only a small number of nodes.

Execution of put() operation:
- Coordinator generates the vector clock for the new version
- Coordinator writes the new version locally
- Coordinator sends the new version + vector clock to the N highest-ranked reachable nodes
- If at least W-1 nodes respond then the write is considered successful.

Execution of get() operation:
- Coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key
- Coordinator waits for R responses before returning the result to the client
- Reconcilation done by the applications is written back

Handling temporary failures: Hinted Handoff

Cons of traditional quorum approach:

Unavailable during server failures and network partitions and durability reduced under the simplest of failure conditions
Doesn't strict quorum membership and use "sloppy quorum":

all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.
Hinted Handoff:

Consider the example of Dynamo configuration given in Figure 2 with N=3. In this example, if node A is temporarily down or unreachable during a write operation then a replica that would normally have lived on A will now be sent to node D. This is done to maintain the desired availability and durability guarantees. The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case A). Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. Upon detecting that A has recovered, D will attempt to deliver the replica to A. Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.
Using hinted handoff, Dynamo ensures that the read and write operations are not failed due to temporary node or network failures.

Handling permanent failures: Replica synchronization

Dynamo implements an anti-entropy (replica synchronization) protocol to keep the replicas synchronized.
To detect the inconsistencies between replicas faster and to minimize the amount of transferred data, Dynamo uses Merkle trees:

A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set. Moreover, Merkle trees help in reducing the amount of data that needs to be transferred while checking for inconsistencies among replicas. For instance, if the hash values of the root of two trees are equal, then the values of the leaf nodes in the tree are equal and the nodes require no synchronization. If not, it implies that the values of some replicas are different. In such cases, the nodes may exchange the hash values of children and the process continues until it reaches the leaves of the trees, at which point the hosts can identify the keys that are “out of sync”. Merkle trees minimize the amount of data that needs to be transferred for synchronization and reduce the number of disk reads performed during the anti-entropy process.

Membership and Failure Detection

Use explicit command to add and remove nodes from a Dynamo Ring:
- Adv: nodes may be temporarily down and we don't have to immediately redistribute workload (i.e., think they are out of ring membership) whenever some node are uncontactable. Redistribute workload is expensive.
To prevent logical partitions, some Dynamo nodes play the role of seeds:
- Case: node A joins the ring; node B joins the ring; but A and B would consider each other be the member of the ring at once
- Seeds are nodes that are discovered via an external mechanism and are known to all nodes.
- Typically seeds are fully functional nodes in the Dynamo ring.
- Because all nodes eventually reconcile their membership with a seed, logical partitions are highly unlikely.
node A may consider node B failed if node B does not respond to node A’s messages (even if B is responsive to node C's messages)

Decentralized failure detection protocols use a simple gossip-style protocol that enable each node in the system to learn about the arrival (or departure) of other nodes.

Remarks & Thoughts

I really like this paper. It connects all the classic distributed techniques (i.e., gossip, quorum, consistent hashing, merkle tree) into one.

Reference

"Fast Crash Recovery in RAMCloud"

2018-04-09T22:00:00+08:00

Problem, Motivation
Challenges
Assumptions
Architecture
Durability and Availability
- Buffered Logging
- Fast Recovery
Other interesting details
Remarks & Thoughts
Reference

Problem, Motivation

RAMCloud is a large-scale general-purpose DRAM storage system for datacenters. The system is motivated by the fact that Large-scale apps struggle with utilizing DRAM to its full potential:

DRAM is still majorly used as a cache for some other storage system
Developers have to manage consistency between caches in DRAM and its storage system
Cache misses and backing store overheads

It has four design goals:

Scalbility: 1000-10000 commodity servers with 32-64 GB DRAM/server
Low latency: uniform low-latency access (5-10 μs round-trip times for small read operations)
High throughput: 1M ops/sec/server
High durability and availability

This paper focuses on the "high durability and availability". Replicating all data (x3) in DRAM fix some availability issue but triple the cost and energy usage of the system. Thus, RAMCloud only stores a single copy of data in DRAM, which brings the problem of availability: what happens when server crashes? RAMCloud’s solution to the availability problem is fast crash recovery. Then the problem becomes how to recover from crash within 1s~2s for 64GB or more DRAM data?

Challenges

Durability: RAM is lack of durability. Data is unavailable on crashed nodes.
Availability: How to recover as soon as possible?
- Fast writes: Synchronous disk I/O’s during writes?? Too slow
- Fast crash recovery: Data unavailable after crashes?? No!
Large scale: 10,000 nodes, 100TB to 1PB

Assumptions

Use low-latency Infiniband NICs and switches
- Ethernet switches and NICs typically add at least 200-500 μs to round-trip latency in a large datacenter
DRAM uses an auxiliary power source
- to ensure that buffers can be written to stable storage after a power failure
Every byte of data is in DRAM

Architecture

Each storage server contains a master and a backup. A central coordinator ¹ manages the server pool and tablet configuration. Client applications run on separate machines and access RAMCloud using a client library that makes remote procedure calls.
- master: manages RAMCloud objects in its DRAM and services client requests
- backup: stores redundant copies of objects from other masters using its disk or flash memory
- coordinator: manages configuration information such as the network addresses of the storage servers and the locations of objects
- tablets: consecutive key ranges within a single table
Data model: object consists of [identifier(64b), version(64b), Blob(<=1MB)]

Durability and Availability

Durability: 1 copy in DRAM; Backup copies on disk/flash: durability ~ free!
Availiability:
- Fast writes: Buffered Logging
- Fast crash recovery: Large-scale parallelism to reconstruct data (similar to MapReduce)

Buffered Logging

When a master receives a write requests, it updates its in-memory log and forwards the new data to several backups, which buffer the data in their memory. Master maintains a hash table to record locations of data objects. The data is eventually written to disk or flash in large batches. Backups must use an auxiliary power source to ensure that buffers can be written to stable storage after a power failure.
- No disk I/O during write requests
- Master’s memory also log-structured
- Log cleaning ~ generational garbage collection
- master's log is divided into 8MB segments
- Hash table is used for quickly lookup object in log

Note

This part idea borrows from log-structured file system. Log structure in memory is thought to be interesting by Vijay.

Fast Recovery

Three different recovery schemas:
- One recovery master, small backup servers (disk bandwidth bottleneck)
- One recovery master, large backup servers (network bandwidth bottleneck)
- Several recovery masters, large backup servers (good!)

Divide each master’s data into partitions
- Partition and scatter log data to more backups randomly. So backup data can be read in parallel when the master crashed.
- Recover each partition on a separate recovery master
- Partitions based on tables & key ranges, not log segment
- Each backup divides its log data among recovery masters
Each mater computes the strategy to form partitions and upload the strategy to coordinator as will. Coordinator follows crashed master's will to divide crashed master's data into partitions and assign the recoverying work to recovery masters (see section 3.5.3)

Other interesting details

Each RAMCloud master decides independently where to place each replica, using a combination of randomization and refinement.

When a master needs to select a backup for a segment, it chooses several candidates at random from a list of all backups in the cluster. Then it selects the best candidate, using its knowledge of where it has already allocated segment replicas and information about the speed of each backup’s disk. The best backup is the one that can read its share of the master’s segment replicas most quickly from disk during recovery. A backup is rejected if it is in the same rack as the master or any other replica for the current segment. Once a backup has been selected, the master contacts that backup to reserve space for the segment. At this point the backup can reject the request if it is overloaded, in which case the master selects another candidate.

Note

Advantages of randomization + refinement:

eliminate behavior: all masters choosing the same backups in a lock-step fashion
provides a solution nearly as optimal as a centralized manager
make segment distribution nearly uniform
- compensate for each machine difference: more powerful machine, high disk speed, more likely to be selected
- handles the entry of new backups gracefully: new machine, less workload, more likely to be selected

Remarks & Thoughts

Each segment is randomly shuffled to multiple backups and recovery is constructed in parallel, which reminds me of the MapReduce. Segments are distributed uniformly across backups, which mirrors chunking the data evenly in Map phase. The recovery from multiple recovery masters and each recovery master only done part of the whole need-to-be-recovered data, which reminds me of Reduce phase. Even Prof. John Ousterhout in the video thinks that MapReduce almost solve their problem.

This finding is quite atonishing to me because MapReduce paper comes out in 2004 and RAMCloud comes out in 2011. If we take a slightly different angle to look at prior work, we may find something new. That's provoking for me in terms of research.

Reference

The coordinator will use ZooKeeper to store its configuration information, which consists of a list of active storage servers along with the tablets they manage. ZooKeeper本身是一个非常牢靠的记事本，用于记录一些概要信息。Hadoop依靠这个记事本来记录当前哪些节点正在用，哪些已掉线，哪些是备用等，以此来管理机群。 ↩

"PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees"

2018-03-30T00:45:00+08:00

Introduction
Background
Fragmented Log-Structured Merge Tree (FLSM)
Building PebblesDB over FLSM
- Improving Read Performance
- Improving Range Query Performance
Remarks
Reference
Further Reading

--- 05/22/18 UPDATE ---

Note

I write this post when I start to read system papers and even before I read through PebblesDB's code. My system paper reading skill is lacking and I cannot fully grasp the essence of the paper back then. Please jump to the Remarks section for a concise summary of LSM and FLSM (PebblesDB's data structure). The other sections are filled with unnecessary details and they are helpful only when you want to build something based on the paper's implementation.

Introduction

One fundamental problem is the high write amplification of key-value stores for write-intensive workloads.
Write amplification = the ratio of total write IO performed by the store to the total user written
High write amplification is bad
- increases the load on storage devices such as SSDs, which have limited write cycles before the bit error rate becomes unacceptable
- results in frequent device wear out and high storage
- reduces write throughput
  
  RocksDB write throughput is 10% of read throughput thanks to write amplifcation

Note

虽然LSM的写放大最近被研究很多，但是就写放大本身而言，是一个很古老的问题。在计算机体系中，如果相邻两层的处理单元不一致或者应用对一致性等有特殊的需求，就很可能出现写放大问题。比如CPU cache和内存cell，文件系统block和磁盘扇区，数据库block和文件系统block，数据库redo/undo，文件系统journal等.

Reduce write amplification inuition: log-structured merge trees (LSM) data structures is the root cause to the write amplification

LSM stores maintain data in sorted order on storage, enabling efficient querying of data. However, when new data is inserted into an LSM-store, existing data is rewritten to maintain the sorted order, resulting in large amounts of write IO.
Key idea to reduce write amplification:
- Combine LSM with skip list: fragmenting data into smaller chunks that are organized using guards on storage. Guards allow FLSM to find keys efficiently. ¹
Why the idea can improve write throughput intuitively:
- Write operations on LSM stores are often stalled or blocked while data is compacted (rewritten for better read performance); by drastically reducing write IO, FLSM makes compaction signicantly faster, thereby increasing write throughput.

Background

Key-Value Store Operations

The get(key) operation returns the latest value associated with key.
The put(key, value) operation stores the mapping from key to value in the store. If key was already present in the store, its associated value is updated.
Some key-value stores such as LevelDB provide an iterator over the entire key-value store. it.seek(key) positions the iterator it at the smallest key >= key. The it.next() call moves it to the next key in sequence. The it.value() call returns the value associated with the key at the current iterator position.
The range_query(key1, key2) operation returns all key-value pairs falling within the given range. Range queries are often implemented by doing a seek() to key1 and doing next() calls until the iterator passes key2.

LSM

LSM ² is treated as a replacement for B+ Tree
Why not B+ Tree:
- low write throughput: B+ Trees are a poor fit for write-intensive workloads: updating the tree requires multiple random writes (10-100X slower than sequential writes).
- high write amplification (61X write amplification)
The log-structured merge trees (LSM) data structure takes advantage of high sequential bandwidth by only writing sequentially to storage. Writes are batched together in memory and written to storage as a sequential log (termed an sstable). Each sstable contains a sorted sequence of keys.

Note

A "Sorted String Table" then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside. Duplicate keys are fine, there is no need for "padding" for keys or values, and keys and values are arbitrary blobs. Read in the entire file sequentially and you have a sorted index. Optionally, if the file is very large, we can also prepend, or create a standalone key:offset index for fast access. That's all an SSTable is: very simple, but also a very useful way to exchange large, sorted data segments.

Sstables on storage are organized as hierarchy of levels. Each level contains multiple sstables, and has a maximum size for its sstables.

In a 5-level LSM, Level 0 is the lowest level and Level 5 is the highest level. The amount of data (and the number of sstables) in each level increases as the levels get higher. The last level in an LSM may contain hundreds of gigabytes. Application data usually flows into the lower levels and is then compacted into the higher levels. The lower levels are usually cached in memory.
LSM maintains the following invariant at each level: all sstables contain disjoint sets of keys.

For example, a level might contain three sstables: $[1 \dots 6], [8 \dots 12]$, and $[100 \dots 105]$. Each key will be present in exactly one sstable on a given level. As a result, locating a key requires only two binary searches: one binary search on the starting keys of sstables (maintained separately) to locate the correct sstable and another binary search inside the sstable to find the key. If the search fails, the key is not present in that level.

LSM Operations

get() returns the latest value of the key

Since the most recent data will be in lower levels, the key-value store searches for the key level by level, starting from Level 0; if it finds the key, it returns the value. Each key has a sequence number that indicates its version. Finding the key at each level requires reading and searching exactly one sstable.
seek() and next() require positioning an iterator over the entire key-value store.

implemented using multiple iterators (one per level); each iterator is first positioned inside the appropriate sstable in each level, and the iterator results are merged. The seek() requires finding the appropriate sstables on each level, and positioning the sstable iterators. The results of the sstable iterators are merged (by identifying the smallest key) to position the key-value store iterator. The next() operation simply advances the correct sstable iterator, merges the iterators again, and re-positions the key-value store iterator.
put() writes the key-value pair, along with a monotonically increasing sequence number, to an in-memory skip list called the memtable. When the memtable reaches a certain size, it is written to storage as a sstable at Level 0. When each level contains a threshold number of files, it is compacted into the next level.

Assume Level 0 contains [2, 3] and [10, 12] sstables. If Level 1 contains [1,4] and [9, 13] sstables, then during compaction, Level 1 sstables are rewritten as [1, 2, 3, 4] and [9, 10, 12, 13], merging the sstables from Level 0 and Level 1. Compacting sstables reduces the total number of sstables in the key-value store and pushes colder data into higher levels. The lower levels are usually cached in memory, thus leading to faster reads of recent data.

Note

Think about memtable as in-memory SSTable.

Updating or deleting keys in LSM-based stores does not update the key in place, since all write IO is sequential. Instead, the key is inserted once again into the memtable with a higher sequence number; a delete key is inserted again with a special flag (often called a tombstone flag). Due to the higher sequence number, the latest version of the flag will be returned by the store to the user.

Write Amplification: Root Cause

The root cause for write amplification: multiple rewrites of sstables during compaction. In other words, sstables can be rewritten multiple times when new data is compacted into them.

For example, when compaction happens from $t_1$ to $t_2$, sstable with [1,100] has to be rewritten to [1,10,100] and sstable with [200,400] has to be rewritten as [200,210,400] (i.e., We have to read [10,210], [1,100], [200,400] out of levels, merge sort them, and write them back.)

L0文件里面包含的key同时在L1层的多个文件（甚至全部文件）被包含，所以如果想把L0下推到L1，那么就需要将整个L0/L1文件内的key读出来重新排序写入到L1。典型情况下，L0数据量是L1的1/10，为了这么点数据量重写所有数据显然不划算。L1...Ln道理类似

Note

放大问题的本质是一个系统对“随时全局有序"的需求有多么的强烈。所谓随时，就是任何的写入都不能导致系统无序；所谓全局，即系统内任意元单位之间都要保持有序。B-Tree系列是随时全局有序的典型代表，而Fractal tree打破了全局的约束，允许局部无序，提升了随机写能力；LSM系列进一步打破了随时的约束，允许通过后台的compaction来整理排序。在LSM这种依靠后台整理来保序的系统里面，系统对序的要求越强烈，写放大越严重。PebblesDB针对写放大提出的解决方案是弱化全局有序的约束，其将每一层进行分段，每个段称为一个guard，guard之间没有重叠的key，且每层的guard之间要求保序，但是guard内部可以无序。

Fragmented Log-Structured Merge Tree (FLSM)

FLSM counters mutlple rewrites of sstables by fragmenting sstables into smaller units. Instead of rewriting the sstable, FLSM’s compaction simply appends a new sstable fragment to the next level. Doing so ensures that data is written exactly once in most levels

Note

Here, I'm guessing "append" really means pointer change from one node to another. Thus, the only time IO is performed when the data is first written to sstable at a level.

Guards

In the classical LSM, each level contains sstables with disjoint key ranges (i.e., each key will be present in exactly one sstable). Maintaining this invariant is the root cause of write amplification, as it forces data to be rewritten in the same level.
The FLSM data structure discards this invariant: each level can contain multiple sstables with overlapping key ranges, so that a key may be present in multiple sstables. To quickly find keys in each level, FLSM organizes the sstables into guards (similar to level concept in skip list)
Each level contains multiple guards. Guards divide the key space (for that level) into disjoint units. Each guard $G_i$ has an associated key $K_i$, chosen from among keys inserted into the FLSM. Each level in the FLSM contains more guards than the level above it; the guards get progressively more fine-grained as the data gets pushed deeper and deeper into the FLSM. As in a skip-list, if a key is a guard at a given level $i$, it will be a guard for all levels $> i$.
Each guard has a set of associated sstables. Each sstable is sorted. If guard $G_i$ is associated with key $K_i$ and guard $G_{i+1}$ with $K_{i+1}$, an sstable with keys in the range $[K_i,K_{i+1})$ will be attached to $G_i$ . Sstables with keys smaller than the first guard are stored in a special sentinel guard in each level. The last guard$G_n$ in the level stores all sstables with keys $\ge K_n$ . Guards within a level never have overlapping key ranges. Thus, to find a key in a given level, only one guard will have to be examined.
In FLSM compaction, the sstables of a given guard are (merge) sorted and then fragmented (partitioned), so that each child guard receives a new sstable that fits into the key range of that child guard in the next level.

Note

Some observations about Figure 3:

A put() results in keys being added to the in-memory memtable (not shown). Eventually, the memtable becomes full, and is written as an sstable to Level 0. Level 0 does not have guards, and collects together recently written sstables.
Each level has a sentinel guard that is responsible for sstables with keys < than the first guard.
Data inside an FLSM level is partially sorted: guards do not have overlapping key ranges, but the sstables attached to each guard can have overlapping key ranges.（In level 3 Guard: 5, [5,35,40] and [7] are overlapping)

Selecting Guards

In the worst case, if one guard contains all sstables, reading and searching such a large guard (and all its constituent sstables) would cause an un-acceptable increase in latency for reads and range queries
guards are not selected statically; guards are selected probabilistically from inserted keys, preventing skew.
Current selection policy:
- if the guard probability is 1/10, one in every 10 inserted keys will be randomly selected to be a guard.
- The guard probability is designed to be lowest at Level 1 (which has the fewest guards), and it increases with the level number (as higher levels have more guards)
if a key $K$ is selected as a guard in level i, it becomes a guard for all higher levels $i + 1, i + 2$ etc. The guards in level $i + 1$ are a strict superset of the guards in level $i$ (in Figure 3, key 5 is chosen as a guard for Level 1; therefore it is also a guard for levels 2 and 3.)

Inserting and Deleting Guards

Guards are inserted asynchronously into FLSM
When guards are selected, they are added to an in-memory set termed the uncommitted guards. Sstables are not partitioned on storage based on (as of yet) uncommitted guards; as a result, FLSM reads are performed as if these guards did not exist. At the next compaction cycle, sstables are partitioned and compacted based on both old guards and uncommitted guards; any sstable that needs to be split due to an uncommitted guard is compacted to the next level. At the end of compaction, the uncommitted guards are persisted on storage and added to the full set of guards. Future reads will be performed based on the full set of guards.
guard deletion was not required
Guard deletion is also performed asynchronously similar to guard insertion. Deleted guards are added to an in-memory set. At the next compaction cycle, sstables are re-arranged to account for the deleted guards. Deleting a guard G at level i is done lazily at compaction time. During compaction, guard G is deleted and sstables belonging to guard G will be partitioned and appended to either the neighboring guards in the same level i or child guards in level i + 1. Compaction from level i to i + 1 proceeds as normal (since G is still a guard in level i + 1). At the end of compaction, FLSM persists metadata indicating G has been deleted at level i. If required, the guard is deleted in other levels in a similar manner. Note that if a guard is deleted at level i, it should be deleted at all levels < i; FLSM can choose whether to delete the guard at higher levels > i.

FLSM Operations

get() operation first checks the in-memory memtable. If the key is not found, the search continues level by level, starting with level 0. During the search, if the key is found, it is returned immediately; To check if a key is present in a given level, binary search is used to find the single guard that could contain the key. Once the guard is located, its sstables are searched for the key. Thus, in the worst case, a get() requires reading one guard from each level, and all the sstables of each guard.
Range queries require collecting all the keys in the given range. FLSM first identifies the guards at each level that intersect with the given range. Inside each guard, there may be multiple sstables that intersect with the given range; a binary search is performed on each sstable to identify the smallest key overall in the range. Identifying the next smallest key in the range is similar to the merge procedure in merge . When the end of range query interval is reached, the operation is complete, and the result is returned to the user.
put() adds data to an in-memory memtable. When the memtable gets full, it is written as a sorted sstable to Level 0. When each level reaches a certain size, it is compacted into the next level.
Similar to LSM, updating or deleting a key involves inserting the key into the store with an updated sequence number or a deletion flag respectively. the deletion of the key does not result in deletion of the related guard; deleting a guard will involve a signicant amount of compaction work. Thus, empty guards are possible.
Compaction:
- The sstables in the guard are first (merge) sorted and then partitioned into new sstables based on the guards of the next level; the new sstables are then attached to the correct guards.
  
  Assume a guard at Level 1 contains keys [1, 20, 45, 101, 245]. If the next level has guards 1, 40, and 200, the sstable will be partitioned into three sstables containing [1, 20], [45, 101], and [245] and attached to guards 1, 40, and 200 respectively.
- New sstables are simply added to the correct guard in the next level.
- Two exceptions to no-rewrite rule:
  - at the highest level (e.g,. Level 5) of FLSM, the sstables have to be rewritten during compaction; there is no higher level for the sstables to be partitioned and attached to.
  - for the second-highest level (e.g,. Level 4), FLSM will rewrite an sstable into the same level if the alternative is to merge into a large sstable in the highest level

Limitations

Since get() and range query operations need to examine all sstables within a guard, the latency of these operations is increased in comparison to LSM.

Building PebblesDB over FLSM

Due to the limitation of FLSM, several existing techniques are applied to improve read performance (i.e., put() and range query operations)

Improving Read Performance

Cause: get() in FLSM causes all the sstables of one guard in each level to be examined. In contrast, in LSM, exactly one sstable per level needs to be examined.
Improvement technique:
- Sstable Bloom Filters:
  - A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is present in a given set in constant time
  - A bloom filter can produce false positives, but not false negatives (i.e., the key is in sstables but bloom filters say no)
  - PebblesDB attaches a bloom filter to each sstable to eciently detect if a given key could be present in the sstable.

Improving Range Query Performance

Cause: require examining all the sstables of a guard for FLSM. Since LSM stores examine only one sstable per level, FLSM stores have significant overhead for range queries
Improvement technique:
- Seek-Based Compaction
- Parallel Seek

Remarks

Log-structured Merge Tree (LSM) is a data structure that is used to provide good write performance by leverage log organizations. In details, write to LSM-based storage system is first written to in-memory log called MemTable by appending the corresponding key-value pair at the end of log. Doing the write through appending is the key differentiator from the B-tree-based storage system as we are doing the sequential write instead of the random update, which can reduce write amplification and improve the write performance.

To improve the read performance and make the system scalable to the large datasets, LSM-based storage system organizes their logs into levels. The first layer (numbered as 0), which MemTable is stored in memory. All the logs (sstables) in the rest levels are written to the persistent storage devices. Acceptable read performance is maintained by lowering down the number of logs. This is done by compactions, which merges the logs in the upper level into the lower level (i.e., merging sstables from level 1 to level 2). However, there is problem with the merging process as we need to read both logs from the upper level and lower level into memory and perform the merge. This mechanism naturally introduces the write amplifcation effect, which decreases the write performance.

To combat write amplification, Fragmented Log-Structured Merge Trees (FLSM) and its implementation PebblesDB are proposed. The key idea is shown in picture above. FLSM uses guard, which can be thought of as the key for a collection of logs. the keys in the collection of logs have to be in the range between the current guard's key and the previous guard's key. During the compaction, the the log in the upper level is split across the guard keys in the lower level. In the example shown above, 10 and 80 from level 1 is split to 10 in guard 60 and 80 to guard 100 in level 2. As one can see, this split and append avoids reading the data from lower level (e.g., level 2), which reduces the write amplification and improve the write performance. Read performance stays still thanks to the guard as we can first search guard key and then logs within the guard to locate the corresponding value for the given key.

Reference

"BLEU: a Method for Automatic Evaluation of Machine Translation"

2018-03-28T01:24:00+08:00

Introduction
The Baseline BLEU Metric
- Modified n-gram precision
Remarks on BLEU
Reference

Introduction

How does one measure translation performance? The central idea: The closer a machine translation is to a professional human translation, the better it is. Thus requires:
- a numerical “translation closeness” metric
- a corpus of good quality human reference translations
Fashion the closeness metric after word error rate metric used by the speech recognition community
The main idea is to use a weighted average of variable length phrase matches against the reference translations.

The Baseline BLEU Metric

Example 1:
- Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.
- Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.
- Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
- Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
- Reference 3: It is the practical guide for the army always to heed the directions of the party.
Candidate 1 is better than Candidate 2 because Candidate 1 shares many words and phrases with these three reference translations, while Candidate 2 does not.
- Candidate 1 shares "It is a guide to action" with Reference 1, "which" with Reference 2, "ensures that the military" with Reference 1, "always" with References 2 and 3, "commands" with Reference 1, and finally "of the party" with Reference 2 (all ignoring capitalization). In contrast, Candidate 2 exhibits far fewer matches, and their extent is less.
Rank Candidate 1 higher than Candidate 2 simply by comparing n-gram matches between each candidate translation and the reference translations.
The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate translation is.

Modified n-gram precision

We compute unigram matches to illustrate the idea.

Example 2:
- Candidate: the the the the the the the.
- Reference 1: The cat is on the mat.
- Reference 2: There is a cat on the mat.
Precision (# candidate translation words (unigrams) which occur in any reference translation / the total number of words in the candidate translation) doesn't work: MT systems can overgenerate “reasonable” words, resulting in improbable, but high-precision, translations like Example 2.
Intuition: a reference word should be considered exhausted after a matching candidate word is identified. We formalize this intuition as the modified unigram precision.
How to compute modified unigram precision:
1. counts the maximum number of times a word occurs in any single reference translation
2. one clips the total count of each candidate word by its maximum reference count
3. adds these clipped counts (i.e., $\text{Count}_{clip}$) up, and divides by the total (unclipped) number of candidate words.

Note

$\text{Count}_{clip} = \min(\text{Count}, \text{Max_Ref_Count})$. In other words, one truncates each word’s count, if necessary, to not exceed the largest count observed in any single reference for that word.

Examples on modified unigram precision calculation:
- Example 2: modified unigram precision is $2/7$, even though its standard unigram precision is $7/7$.
- Example 1: Candidate 1 achieves a modified unigram precision of $17/18$; whereas Candidate 2 achieves a modified unigram precision of $8/14$
  
  Let's calculate $17/18$. The counts shown below. For word "the", "the" appears 3 times in Candidate 1 and 4 comes from max(# "the" in ref1, # "the" in ref2, # "the" in ref3) = max(1,4,4).

| word          | it | is | a | guide | to | action | which | ensures | that | the | miltary | always | obeys | commands | of | party | SUM |
|---------------|----|----|---|-------|----|--------|-------|---------|------|-----|---------|--------|-------|----------|----|-------|-----|
| count         | 1  | 1  | 1 | 1     | 1  | 1      | 1     | 1       | 1    | 3   | 1       | 1      | 1     | 1        | 1  | 1     | 18  |
| Max_Ref_Count | 1  | 1  | 1 | 1     | 1  | 1      | 1     | 1       | 2    | 4   | 1       | 1      | 0     | 1        | 1  | 1     |     |
| Count_clip    | 1  | 1  | 1 | 1     | 1  | 1      | 1     | 1       | 1    | 3   | 1       | 1      | 0     | 1        | 1  | 1     | 17  |

How to compute Modified n-gram precision (computed similarly for any n):
1. all candidate n-gram counts and their corresponding maximum reference counts are collected.
2. The candidate counts are clipped by their corresponding reference maximum value, summed, and divided by the total number of candidate n-grams.
Examples on modified bigram precision calculation:
- Example 1: Candidate 1 achieves a modified bigram precision of $10/17$, whereas the lower quality Candidate 2 achieves a modified bigram precision of $1/13$.
- Example 2: the (implausible) candidate achieves a modified bigram precision of 0.
Modified n-gram precision on corpus
- We first compute the n-gram matches sentence by sentence. Next, we add the clipped n-gram counts for all the candidate sentences and divide by the number of candidate n-grams in the test corpus to compute a modified precision score, $p_n$, for the entire test corpus.
We want to combine all n-gram precision on corpus (i.e., $n=1,2,3,4$) together into a single number metric. the modified n-gram precision decays roughly exponentially with n: the modified unigram precision is much larger than the modified bigram precision which in turn is much bigger than the modified trigram precision. BLEU uses the geometric mean of the modified n-gram precisions.
Modified n-gram precision alone fails to enforce the proper translation length. Candidate translations longer than their references are already penalized by the modified n-gram precision measure. However, candidate translations shorter than their references are not penalized.
- Example 3:
  - Candidate: of the
  - Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
  - Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
  - Reference 3: It is the practical guide for the army always to heed the directions of the party.
- Because this candidate is so short compared to the proper length, one expects to find inflated precisions: the modified unigram precision is 2/2, and the modified bigram precision is 1/1.
Traditional recall is not a good measure to enforce proper length translation:
- Example 4:
  - Candidate 1: I always invariably perpetually do.
  - Candidate 2: I always do.
  - Reference 1: I always do.
  - Reference 2: I invariably do.
  - Reference 3: I perpetually do.
- The first candidate recalls more words from the references, but is obviously a poorer translation than the second candidate
Sentence brevity penalty
- We wish to make the brevity penalty 1.0 when the candidate’s length is the same as any reference translation’s length.
- if there are three references with lengths 12, 15, and 17 words and the candidate translation is a terse 12 words, we want the brevity penalty to be 1.
- We call the closest reference sentence length the “best match length.”
- we compute the brevity penalty over the entire corpus to al- low some freedom at the sentence level.
Compute sentence brevity penalty
1. Compute the test corpus’ effective reference length, $r$, by summing the best match lengths for each candidate sentence in the corpus.
2. We choose the brevity penalty to be a decaying exponential in $r/c$, where $c$ is the total length of the candidate translation corpus.
Put everything together to calculate BLEU

We take the geometric mean of the test corpus’ modified precision scores and then multiply the result by an exponential brevity penalty factor. We first compute the geometric average of the modified n-gram precisions, $p_n$, using n-grams up to length $N$ and positive weights $w_n$ summing to one. Next, let $c$ be the length of the candidate translation and $r$ be the effective reference corpus length. We compute the brevity penalty BP,

Remarks on BLEU

This sort of modified n-gram precision scoring captures two aspects of translation: adequacy and fluency:

A translation using the same words (1-grams) as in the references tends to satisfy adequacy. The longer n-gram matches account for fluency.
BLEU only needs to match human judgment when averaged over a test corpus; scores on individual sentences will often vary from human judgments. (For example, a system which produces the fluent phrase “East Asian economy” is penalized heavily on the longer n-gram precisions if all the references happen to read “economy of East Asia.”)
The BLEU metric ranges from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation
the more reference translations per sentence there are, the higher the score is.
We may use a big test corpus with a single reference translation, provided that the translations are not all from the same translator.
BLEU has shown good performance for corpus-level comparisons over which a high number of n-gram matches exist. However, at a sentence-level the n-gram matches for higher n rarely occur. As a result, BLEU performs poorly when comparing individual sentences.
Mooney's slides on MT has nice illustration of modified bigram precision calculation

Reference

"Existential Consistency: Measuring and Understanding Consistency at Facebook"

2018-03-08T20:24:00+08:00

Overview
Facebook’s Replicated Storage
Consistency Models
Reference

Overview

Facebook Study
- Analyzed a small portion of the Facebook traffic to the TAO graph system
- Analyzed what consistency models hold
- Analyzed when readers get anomalous results

Facebook’s Replicated Storage

Facebook Data Model
- Graph Data Model
- Vertex: unique ID + data
- Edges: between two vertexes, contains data, indexed by source vertex
Database
- Horizontally (i.e., row) sharded, geo-replicated database
- Each region has a full copy
- Each shard has a master which asynchronously updates the other regions
Two-Level Cache
- Root cache sits in front of the database
- Leaf caches sit in front of the root caches
- Write-through caches
- Reads:
  - progress down the stack in their local region on cache misses from leaf cache to root cache, and then to local database. The cache-hit ratios are very high, so reads are typically served by the leaf caches.
- Writes:
  - They are synchronously routed through their leaf cache (1) to their local root cache (2) to the root-master cache (3), and to the master database shard (4) and back (5–8).
  - Each of those caches applies the write when it forwards the database’s acknowledgment back towards the client.
  - The root caches in the master ($6'$) and originating regions ($7'$) both asynchronously invalidate the other leaf caches in their region
  - The database master asynchronously replicates the write to the slave regions ($5'$). When a slave database in a region that did not originate the write receives it, the database asynchronously invalidates its root cache ($6''$) that in turn asynchronously invalidates all its leaf caches ($7''$).

Consistency Models

Local Consistency Models
- Local: A consistency model, C, is local if the system as a whole provides C whenever each individual object provides C
Linearizability
- Linearizability is the strongest consistency model for non-transactional systems.
- Intuitively, linearizability ensures that each operation appears to take effect instantaneously at some point between when the client invokes the operation and it receives the response.
- More formally, linearizability dictates that there exists a total order over all operations in the system, and that this order is consistent with the real-time order of operations.
  - If operation A completes before operation B begins, then A will be ordered before B.
- Linearizability avoids anomalies by ensuring that writes take effect in some sequential order consistent with real time, and that reads always see the results of the most recently completed write.
Per-Object Sequential Consistency
- Per-object sequential consistency requires that there exists a legal, total order over all requests to each object that is consistent with client’s orders.
- Intuitively, there is one logical version of each object that progresses forward in time.
- Clients always see a newer version of an object as they interact with it.
- Different clients, however, may see different versions of the object.
  - One client may be on version 100 of an object, while another client may see version 105.
Read-After-Write Consistency
- when a write request has committed, all following read requests to that cache always reflect this write or later writes.
- Region read-after-write consistency applies the constraint for reads in the same region as a write. Global read-after-write consistency applies the constraint for all reads.
Eventual Consistency
- Eventual consistency requires that replicas “eventually” agree on a value of an object, i.e., when they all have received the same set of writes, they will have the same value.
- Eventual consistency allows replicas to answer reads immediately using their current version of the data, while writes are asynchronously propagated in the background. While writes are propagating between replicas, different replicas may return different results for reads.
- A client can update any replica of an object and all updates to an object will eventually be applied, but potentially in different orders at different replicas.
Facebook’s Consistency
- per-object sequential consistency (per-cache) + read-after-write (per-cache) + eventual consistency (across caches)
- User sessions are typically handled exclusively by one leaf cache, and thus we expect most of them to receive per-object sequential and read-after-write consistency.
- User sessions spread across multiple leaf caches receive eventual consistency.

Reference

https://www.cs.utexas.edu/~vijay/cs380D-s18/feb6-fb.pdf
https://www.allthingsdistributed.com/2008/12/eventually_consistent.html

"PNUTS: Yahoo!’s Hosted Data Serving Platform"

2018-03-08T20:24:00+08:00

Introduction
Data and Query Model
Consistency Model
System Architecture
Further Readings
Reference

Introduction

PNUTS system, a massive-scale, hosted database system to support Yahoo!’s web applications. Our focus is on data serving for web applications, rather than complex queries, e.g., offline analysis of web crawls.
PNUTS Goals
- Scalability
- Low latency, predictable latency
- Must handle attacks: flash crowds, denial of service
- High Availability
- Eventual Consistency
Design Purpose
- Our system is designed primarily for online serving workloads that consist mostly of queries that read and write single records or small groups of records
Application scenarios
- User database
- Social Applications
- Metadata for file systems
- Listings Management
- Session Data
PNUTS Overview
- Data Model and Features
  - a simple relational model to users, and supports single-table scans with predicates
  - Features:
    - catter-gather operations
    - a facility for asynchronous notification of clients
    - a facility for bulk loading.
- Fault Tolerance
  - employs redundancy at multiple levels (data, metadata, serving components, etc.) and leverages our consistency model to support highly-available reads and writes even after a failure or partition
- Pub-Sub Message System
  - We chose pub/sub over other asynchronous protocols (such as gossip) because it can be optimized for geographically distant replicas and because replicas do not need to know the location of other replicas.
- Record-level Mastering
  - To meet response-time goals, PNUTS cannot use write-all replication protocols that are employed by systems deployed in localized clusters (e.g. GFS)
  - Not every read of the data necessarily needs to see the most current version. We have therefore chosen to make all high latency operations asynchronous, and to support record-level mastering.
  - Asynchrony allows us to satisfy latency budget (50-100 ms) despite geographic distribution, while record-level mastering allows most requests, including writes, to be satisfied locally.
- Hosting
  - PNUTS is a hosted, centrally-managed database service shared by multiple applications
  - Significantly reduces application development time
  - Consolidating multiple applications onto a single service allows us to amortize operations costs over multiple applications, and apply the same best practices to the data management of many different applications
  - having a shared service allows us to keep resources (servers, disks, etc.) in reserve and quickly assign them to applications experiencing a sudden upsurge in popularity

Data and Query Model

Data is organized into tables of records with attributes
Each row has a primary row
Rows can have binary blobs
The query language of PNUTS supports selection and projection from a single table
- single-table queries in fact provide very flexible access compared to distributed hash or ordered data stores, and present opportunities for future optimization by the system
Queries:
- Point access: A user may update her own record, resulting in point access
- Range access: Another user may scan a set of friends in order by name, resulting in range access
does not enforce constraints
does not support complex ad hoc queries (joins, group-by, etc.)

Consistency Model

per-record timeline consistency:
- all replicas of a given record apply all updates to the record in the same order
- same as per-object sequential consistency
- implementation:
  - One of the replicas is designated as the master, independently for each record, and all updates to that record are forwarded to the master.
  - the replica receiving the majority of write requests for a particular record becomes the master for that record
API Calls
- Different levels of consistency guarantee
- Read-any: Returns a possibly stale version of the record
- Read-critical(required_version): Returns a version of the record that is strictly newer than, or the same as the $required\_version$
- Read-latest: Returns the latest copy of the record that reflects all writes that have succeeded
- Write: This call gives the same ACID guarantees as a transaction with a single write operation in it (e.g. blind writes, e.g., a user updating his status on his profile)
- Test-and-set-write(required_version): This call performs the requested write to the record if and only if the present version of the record is the same as required_version
  - The test-and-set write ensures that two such concurrent increment transactions are properly serialized
  - allows us to implement single-row transactions without any locks
can provide serializability on a per-record basis
- no guarantees as to consistency for multi-record transactions
- if an application reads or writes the same record multiple times in the same “transaction,” the application must use record versions to validate its own reads and writes to ensure serializability for the “transaction.”

System Architecture

The system is divided into regions, where each region contains a full complement of system components and a complete copy of each table
- Regions are typically, but not necessarily, ge- ographically distributed
our system does not have a traditional database log or archive data
- use of a pub/sub mechanism for both reliability and replication
- we rely on the guaranteed delivery pub/sub mechanism to act as our redo log, replaying updates that are lost before being applied to disk due to failure

Data Storage and Retrieval

How the components within a region provide data storage and retrieval.

Data tables are horizontally partitioned into groups of records called tablets
- Tablets are scattered across many servers
- each tablet is stored on a single server within a region
- Each server has 100s-1000s of tablet
- Tablet size: 100s of MB or a few GBs
Three components in architecture are primarily responsible for managing and providing access to data tablets: the storage unit, the router, and the tablet controller.
- Storage Unit: get(), scan(), set()
- Updates are committed by first writing them to the message broker
- Router: identifies which tablet and server contain data
  - implemented using interval mapping
  - Ordered data: key range sharded into tablets
  - Unordered data: do the same with hash(key)
  - Mapping information stored in memory
  - Contains only a cached copy of the interval mapping (True source of mapping info: tablet controller)
- The tablet controller determines
  - when it is time to move a tablet between storage units for load balancing or recovery
  - when a large tablet must be split.

Replication and Consistency

We use the Yahoo! message broker, a publish/subscribe system developed at Yahoo!, both as our replacement for a redo log and as our replication mechanism.

YMB

Received messages are logged and replicated
- Data updates are considered “committed” when they have been published to YMB.
- At some point after being committed, the update will be asynchronously propagated to different regions and applied to their replicas.
When update has been applied to all replicas, log is pruned
YMB servers are present in different regions
Cross-region traffic is limited to YMB
Messages are ordered within a YMB region
Across regions, different ordering is possible

Consistency via YMB and mastership

Per-record timeline consistency is provided by designating one copy of a record as the master, and directing all updates to the master copy.
In this record-level mastering mechanism, mastership is assigned on a record-by-record basis, and different records in the same table can be mastered in different clusters.
Update considered “committed” once YMB acks it
A committed update may not be visible to other replicas
Master replica for a given record is stored inside that record
In order to enforce primary key constraints, we must send inserts of records with the same primary key to the same storage unit; this storage unit will arbitrate and decide which insert came first and reject the others. Thus, we have to designate one copy of each tablet as the tablet master, and send all inserts into a given tablet to the tablet master
Tablet master can be different from record master
Tablet master serializes updates to record
Record master is the “true” copy of the data
- Update is considered “committed” once record master gets it

Recovery

the tablet controller requests a copy from a particular remote replica (the “source tablet”)
a “checkpoint message” is published to YMB, to ensure that any in-flight updates at the time the copy is initiated are applied to the source tablet
the source tablet is copied to the destination region

Other Database System Functionality

Query Processing

Scatter-gather engine is used
Server has the engine, not the client
- Done to reduce network connections to the server
- Allows optimization over the whole scatter-gather call
Range queries are broken up
Clients keep a continuation object to continue the range query

Notifications

User can subscribe to notifications
Built on top of pub/sub architecture
Accomplished by talking to the YMB
Each tablet has a topic that user subscribe to
Whenever tablet is updated or split, notifications can be sent out

Reference

https://www.cs.utexas.edu/~vijay/cs380D-s18/feb8-pnuts-voting.pdf

Cache, Lease, Consistency, Invalidation

2018-03-07T20:24:00+08:00

Concepts
- Cache
- Consistency model
Reference

Concepts

Cache

Cache is built on client side
Write-through:
- Writes go to the server
- No modified caches
Write-back:
- Writes go to cache
- Dirty cache written to server when necessary
Invalidations:
- Track where data is cached
- When doing a write, invalidate all (other) locations
- Data can live in multiple caches for reading
Write-through invalidations:
- Track all reading caches
- On a write:
  - Server send invalidations to all caches
  - Each cache invalidates, responds
  - Server waits for all invalidations, do update
  - Server return
    - Reads can proceed:
  - If there is a cached copy and if no write waiting at server
Write-back invalidations:
- Track all reading and writing caches
- On a write:
  - Server send invalidations to all caches
  - Each cache invalidates, responds (possibly with updated data)
  - Wait for all invalidations
  - Return
- Reads can proceed when there is a local copy
- Order requests carefully at server
  - Enforce processor order, avoid deadlock
  - Write-through invalidation不用在server order requests因为所有writes直接在server端写，根据request来的order写就可以了。但是Write-back invalidation需要在server端order因为写写在cache里，那么requests写的order就没有了，因此在cache把更新好的data return给server端时，需要在server端重新排序。
Leases:
- Permission to serve data for some time period
- Wait until lease expires before applying updates
- Must account for clock skew!
Strong Leases:
- The term "Lease" referred in Jim Gray's paper
- Read request: key, TIL (time to live)
  - When server returns, he server won't accept writes to the key for TIL seconds after reply sent
  - Client invalidates its cache after TTL seconds from when request was sent
Write-through strong leases:
- Server queues writes until all leases expire (after all leases expire, the cache got invalidated and server then can write)
- Avoid starvation: don’t accept new reads
Write-back strong leases:
- Cache can get a write lease (exclusive)
- Server queues read requests until lease expires
Strong leases vs. Invalidations
- Strong leases potentially slower
- What if a cache fails when it has a key? Strong leases provide better availability
- Can combine techniques:
  - Short lease on entire cache, periodically revalidated
  - All keys invalidated on failure (after lease)
Weak leases
- Cache values until lease expires
- Allow writes, other reads simultaneously
- Advantages:
  - Stateless at server (don’t care who is caching)
  - Reads, writes always processed immediately
- Disadvantages:
  - Consistency model
  - Overhead of revalidations
  - Synchronized revalidations

Note

The key idea is that cache can become stale and we need to have a policy for validating the a cached data item before using it. Thus, we have invalidations and leases to answer the question: If we cache data, how do we make sure it reflects writes of other nodes while maintaining performance? This question implies how we implement consistency. For example, to ensure sequential consistency, we need to make sure all operations to a single key are serialized (as if all the operations go to a single copy), which is done with the help of invalidations / leases.

Consistency model

Anomaly: some sequence of operations (reads and writes) that “shouldn’t” be allowed
Consistency model: which anomalies are possible
- Linearizability (Strict Consistency, Strong Consistency):
  - matches the ideal system
  - Talks about single operations on single objects
  - Literally means: “did the operations happen in a straight line (one after the other)?”
  - Reads always reflect latest write (i.e. Once a read returns value V1, all reads have to return V1 or later values)
  - Concurrent operations can be executed in any order
- Serializability (Sequential Consistency):
  - Execution always equivalent to some interleaving
  - Each node’s operations done in order
  - Guarantees execution of a set of operations (usually each a transaction) is equivalent to some serial execution order
  - Given operations A1, and A2 serializability only demands that the execution order is A1 followed by A2 or A2 followed by A1
  - Serializability makes it seem as if there are no concurrent operations, everything happened one after another
  - Relaxation of linearizability
  - Instead of conforming to a real-time partial order, we use a client-observed partial order
Note

“The result of any execution is the same as if the operations of all the processes were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program” (Lamport, 1979) There is a order on all the processes and operations in each process are ordered in the way sent out by its program.
- Strict Serializability:
  - Combines linearizability and serializability
  - Transactions need to happen in real-time order
  - T1 and T2 are executing concurrently
    - T1 writes object A, and later T2 reads object A
    - Strict Serializability: T1 before T2
    - Serializability: T2 before T1 also valid (In this case, T2 will read old value of object A)
- Weaker models (could have anomalies):
  - Read Your Writes + Eventual Consistency (anomalies are “temporary”)
    - Facebook model, approximately
    - Clients will always see their own writes
    - Clients will eventually see everyone’s writes
    - Eventually the order will be consistent
  - Causal consistency
    - Causal order (Lamport happens-before) observed everywhere
    - Concurrent events can have arbitrary and inconsistent order
  - Transactional models (e.g. Snapshot reads)
    - Some other consistency model + atomicity of transactions

Note

Another angle to look at consistency model is: a contract between the data store and its clients that specifies the results that clients can expect to obtain when accessing the data store.

Why different models?
- Tradeoff between:
  - Performance: consistency requires sync
  - Availability: want to operate when disconnected
  - Programmability: weaker consistency makes applications harder to write (i.e., harder to provide app-level guarantees)
- If you want availability, must give up consistency (by CAP (Consistency, availability, partition tolerance))

Reference

https://courses.cs.washington.edu/courses/cse452/17sp/slides/Caching.pdf (Examples on different consistency models)
https://courses.cs.washington.edu/courses/cse452/17sp/slides/ImplementingCaches1
https://www.cs.utexas.edu/~vijay/cs380D-s18/feb6-fb.pdf
CS439 Alison's slide "Other File Systems"

State Machine Replication Approach

2018-03-07T20:24:00+08:00

State Machine Replication
Applications

State Machine Replication

State Machine Replaction system properties:
- Available
- Fault tolerant
- Appear to behave like a single machine
t fault tolerant: A system consisting of a set of distinct components is t fault tolerant if it satisfies its specification provided that no more than t of those components become faulty during some interval of interest.
t fault-tolerant state machine implementation: replicating that state machine and running a replica on each of the processors in a distributed system. Provided each replica being run by a nonfaulty processor starts in the same initial state and executes the same requests in the same order, then each will do the same thing and produce the same output.
When processors can experience Byzantine failures, an ensemble implementing a t fault-tolerant state machine must have at least $2t + 1$ replicas, and the output of the ensemble is the output produced by the majority of the replicas. (因为Byzantine failures可以产生错误的结果，因此需要大多数replica的结果正确。由于我们是t fault-tolerant, 因此有t replicas可以产生Byzantine failures，因此我们需要额外t+1产生正确结果的replica,也就是2t+1 total replicas)
If processors experience only fail-stop failures, then an ensemble containing $t + 1$ replicas suffices, and the output of the ensemble can be the output produced by any of its members. (Fail-stop failures的话，replica产生错误就停止工作了，因此我们总共只需要t+1 replicas因为只要保证有一个replica工作就可以了)
Implementing Replication:
- Agreement: Every nonfaulty state machine replica receives every request.
  - Implemented by clients
  - When a client makes a request, it broadcasts the request to all servers in the system
- Order: Every nonfaulty state machine replica processes the requests it receives in the same relative order.
  - Implemented by servers
  - Define a total order of requests in the system and execute requests in that order
  - Process a request with the lowest timestamp that has been received by that replica
  - Stability: The replica can never receive an event with a lower timestamp
    - Implementing stability: Receive requests from a client in increasing order (Given by FIFO channels and logical clocks)
    - A request is stable once a request has been received from each client with a greater timestamp

Applications

Lamport Clocks, Vector Clocks

2018-03-06T20:24:00+08:00

Lamport Clocks

In a distributed system, there is no global time and no global state $\implies$ the clock of different nodes in a distributed system can have different values.
Happened-before Relationship:
- Some events in a distributed system happened before other events and others are concurrent
- Happened-before is a partial ordering on events in a distributed system:
  
  Given events $E1, E2, E3$ and $E1$ happens before $E2$ and $E1$ happens before $E3$, we have $E2$ and $E3$ are concurrent and $E1 < E3$ and $E1 < E3$.
$\rightarrow$ relation satisfies the following conditions:

1) If $a$ and $b$ are events in the same process, and $a$ comes before $b$, then $a \rightarrow b$

2) If $a$ is the sending of a message by one process and $b$ is the receipt of the same message by another process, then $a \rightarrow b$

3) If $a \rightarrow b$ and $b \rightarrow c$, then $a \rightarrow c$
Two distinct events $a$ and $b$ are said to be concurrent if $a \not\rightarrow b$ and $b \not\rightarrow a$
Logical Clocks:
- Assigns a monotonically increasing number $C(x)$ for each event $x$ in a process
- If event $x$ happens before event $y$, $C(x) < C(y)$ (Note, $C(x) < C(y) \not\implies x < y$)
- If $x$ and $y$ are in the same process, and $x < y$, then $C(x) < C(y)$
- If $x$ is sending of message, and $y$ receipt of the message, then $x < y$ and $C(x) < C(y)$
Implementing Logical Clocks:
- Within a process $X$, increment $C(x)$ every time an event happens
- When process $X$ receives a message with timestamp $T$, $C(x) = \max(T, C(x)) + 1$
How do we break the tie of the concurrent events and achive total ordering of the events in the sytem:
- If $x$ and $y$ in same process, and $x < y$, $C(x) < C(y)$
- If $x$ and $y$ are concurrent ($x = y$), then $P(x) < P(y) \implies C(x) < C(y)$ ($P(\cdot)$ means process ID)

Vector Clocks

Limitation of Lamport Clocks:
- If $C(x) < C(y)$, we cannot tell whether $x < y$
- We can only say if $x < y$, then $C(x) < C(y)$
Goal: to enable each process to have an approximation of global time at all processes (Every message propagates info about state of whole system)
Each process has a vector of clocks:
- Clock $C_i$ is time for process $i$ as seen by the owner of the vector
- $C_i$ in two different vectors may not be the same
Implementing Vector Clocks:
- Each process $P_i$ updates its component $C_i$ in its vector clock (This update happens for each internal event (e.g. on receiving a message))
- Each message has a vector clock time stamp
- On getting the message, for each field $x$ in the vector: $C[x] = \max(C[x], message\_time\_stamp[x])$
Comparing Vector Timestamps:
- Timestamp $X \le Y$ if all components of $X \le$ corresponding components in $Y$
- Timestamp $X < Y$ if at least one component is strictly lesser, with all others being equal
- Otherwise, $X$ and $Y$ are concurrent

Distributed System Reference Guide

2018-03-06T16:24:00+08:00

This post is reference guide that points to the concepts, system design principles, system concepts mentioned in my posts.

System Concepts
Protocol
System Designs
System Principles

System Concepts

Logical Clocks, Vector Clocks
State Machine: A process whose state depends entirely on the starting state and sequence of operations
Replication: All servers exhibit the same behavior
Sharding: Different data on different servers; Partitioned via some function on keys
Clock Skew: the same sourced clock signal arrives at different components at different times

Protocol

Pub/Sub Mechanism from PNUTS

System Designs

State Machine Replication

System Principles

Scability: sharding + replication
- Usually, shard then replicate
- Each piece of data lives on one replicated shard
Stronger consistency models are easier to reason about (and program for), but more expensive to obtain
Weaker consistency models provide more performance, but hard to understand and program for

"Why do computers stop and what can be done about it?"

2018-01-21T20:24:00+08:00

Availability
Study of Failures
Lessons from Tandem Study
Software Fault Tolerance

Note

Gray, J. Why do computers stop and what can be done about it?, 1985.

Availability

Terminology

Mean Time Between Failures (MTBF)
Mean Time to Repair (MTTR)
Availability: percentage of time the system is operational
- $99.37\%$ percentage availability over 10 days translates to 1.5 hours outage every 10 days on average (i.e. $(1 - 99.37\%) \times 10 \times 24 = 1.51$)
- Availability = MTBF / (MTBF + MTTR) = $\frac{10*24}{(10*24 + 1.5)} = 0.9937$
- If $90\%$ of servers are available $90\%$ of the time, overall availability could be $81\%$ (could be higher when using certain techniques)

Key to Availability

If MTTR is zero, then Availability = MTTF/ (MTTF + 0) = 1
We need to give the illusion of instantaneous repair
Key idea: Modularize the system so that modules can be repaired “instantly”
How to provide instant repair? Have a “hot” spare that can take over instantly
We can analyze schemes to increase availability along several dimensions:
- CAPEX (one time capital expense)
- OPEX (on-going operating expenses)
- Increase in latency?
- Reduction in throughput?

Achieving High Availability

Key ideas: modularity and redundancy
Modularity: a failure within a module affects only that module
- von Neuman’s system required 20K replicas to achieve a MTBF of 100 years
- Why? No modularity
- Large combinations of modules were replicated
Jim Gray’s algorithm (can have the system has MTBF in decades or centuries)
- Hierarchically decompose the system into modules
- Design each module to have MTBF > 1 year
- Make each module fail-fast
- Have a heart-beat message for each module so you know when it fails
- Have spare modules which pick up job of failed module. Failover to spare module should be quick.

Study of Failures

Analyzed cause of failures over 7 months
Study covers 2000 systems, 10M system hours
166 failures reported in this period
59 of these failures are “infant” failures - faulty hardware or new
42% of failures caused by system administration
- Includes software and hardware maintenance: 25%
- Operations: 9%, configuration: 8%
25% software failures, 18% hardware failures
14% of failures caused by environmental failures
- 9% power failures, 5% communication and facilities

Lessons from Tandem Study

Key to high availability: tolerating human errors and operations failures
Need to design systems to have:
- Minimal configuration
- Minimal maintenance
- Simple, consistent interfaces
New systems often have higher failure rate
- Need time to work out these bugs
- Do not deploy systems until they become stable
Jim Gray suggests:
- Do regular hardware maintenance
- Delay software upgrades as long as possible, allow them time to become mature
- Only patch a bug if it is causing outages

Software Fault Tolerance

Applying lessons from before:
- Software modularity through processes and messages
- Fail-fast software modules
- Process-pairs to handle transient faults
- Transactions
Underlying assumption: software faults are transient
- Why? The hard software faults would have been removed in testing and quality assurance checks

Containing Software Faults

Two main approaches:
- Static checking checks the code before it is even run
  - Conservative checking
  - May throw up lots of false positives
- Dynamic checking checks code that is executed
  - Has lower false positives
  - Might not catch all bugs, especially in rarely run code paths

Fail Fast Software

In today’s terms, lots of assert conditions in the code
- Linux kernel is filled with PANIC calls. If something goes wrong, print the stack trace and kill the kernel.

Process Pairs

When one process fails, the other process takes over
Types of process pairs:
- Lockstep: both execute every instruction
- Checkpointing: primary occasionally checkpoints its state, which is copied over to backup
  - Variants: Delta Checkpointing, Kernel Checkpointing
- Persistence: backup gets all its knowledge from persistent storage
  - Need to ensure persistent storage is not inconsistent

Transactions

Provide the ACID property: atomicity, consistency, isolation, durability
Jim Gray argues for persistent process pairs combined with transactions
- Implemented in the Encompass system

Fault-Tolerant Communication

Key idea: sessions and sequence numbers
Same idea used in TCP
Sequence numbers used to identify duplicate and lost messages

"Introduction to Distributed System Design"

2018-01-17T16:24:00+08:00

The Basics
So How Is It Done?
Remote Procedure Calls
Distributed Design Principles

Note

O. Tatebe, "Introduction to Distributed System Design"

The Basics

What is a distributed system?

Here is a "cascading" definition of a distributed system:

A program: is the code you write.
A process: is what you get when you run it.
A message: is used to communicate between processes.
A packet: is a fragment of a message that might travel on a wire.
A protocol: is a formal description of message formats and the rules that two processes must follow in order to exchange those messages.
A network: is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are connected by communication links.
A component: can be a process or any piece of hardware required to run a process, support communications between processes, store data, etc.
A distributed system: is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, such that all components cooperate together to perform a single or small set of related tasks.

Why build a distributed system?

There are lots of advantages including the ability to connect remote users with remote resources in an open and scalable way.

open: each component is continually open to interaction with other components.
scalable: the system can easily be altered to accommodate changes in the number of users, resources and computing entities.

How can a distributed system be reliable?

For a distributed system to be useful, it must be reliable. To be truly reliable, a distributed system must have the following characteristics:

Fault-Tolerant: It can recover from component failures without performing incorrect actions.
Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
Predictable Performance: The ability to provide desired responsiveness in a timely manner.
Secure: The system authenticates access to data and services.

Category of failures in a distributed system

When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design for failure. It is your number one concern. Failures fall into two obvious categories: hardware and software.

Hardware failures: today, problems are most often associated with connections and mechanical devices, i.e., network failures and drive failures.
Software failures: residual bugs in mature systems can be classified into two main categories
- Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode.
- Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions.

Heisenbugs tend to be more prevalent in distributed systems than in local systems. One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes.

The types of failures that can occur in a distributed system:

Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
Network failures: A network link breaks.
Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc.

To design for failure, we must be careful to not make any assumptions about the reliability of the components of a system. Everyone, when they first build a distributed system, makes the following eight assumptions, which are referred as "8 Fallacies":

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

Note

Latency: the time between initiating a request for data and the beginning of the actual data transfer.
Bandwidth: A measure of the capacity of a communications channel. The higher a channel's bandwidth, the more information it can carry.
Topology: The different configurations that can be adopted in building networks, such as a ring, bus, star or meshed.
Homogeneous network: A network running a single network protocol.

So How Is It Done?

We will focus on a particular type of distributed systems design, one that uses a client-server model with mostly standard protocols. It turns out that these standard protocols provide considerable help with the low-level details of reliable network communications, which makes our job easier.

In client-server applications, the server provides some service, such as processing database queries or sending out current stock prices.
The client uses the service provided by the server, either displaying database query results to the user or making stock purchase recommendations to an investor.

The communication that occurs between the client and the server must be reliable. That is, no data can be dropped and it must arrive on the client side in the same order in which the server sent it. Types of server are:

File servers manage disk storage units on which file systems reside.
Database servers house databases and make them available to clients.
Network name servers implement a mapping between a symbolic name or a service description and a value such as an IP address and port number for a process that provides the service.

Some key terms:

Service is used to denote a set of servers of a particular type.
A binding occurs when a process that needs to access a service becomes associated with a particular server which provides the service.

There are many binding policies that define how a particular server is chosen. For example, the policy could be based on locality (a Unix NIS client starts by looking first for a server on its own machine); or it could be based on load balance (a CICS client is bound in such a way that uniform responsiveness for all clients is attempted).
A distributed service may employ data replication, where a service maintains multiple copies of data to permit local access at multiple locations, or to increase availability when a server process may have crashed.
Caching is a related concept and very common in distributed systems. We say a process has cached data if it maintains a copy of the data locally, for quick access if it is needed again.
A cache hit is when a request is satisfied from cached data, rather than from the primary service.

For example, browsers use document caching to speed up access to frequently used documents.

Note

The difference between caching and replication: Caching is similar to replication, but cached data can become stale. Thus, there may need to be a policy for validating a cached data item before using it. If a cache is actively refreshed by the primary service, caching is identical to replication.

The Internet Protocol (IP) suite is the set of communication protocols that allow for communication on the Internet and most commercial networks. The Transmission Control Protocol (TCP) is one of the core protocols of this suite. Using TCP, clients and servers can create connections to one another, over which they can exchange data in packets. The protocol guarantees reliable and in-order delivery of data from sender to receiver.

Note

The IP suite can be viewed as a set of layers, each layer having the property that it only uses the functions of the layer below, and only exports functionality to the layer above. A system that implements protocol behavior consisting of layers is known as a protocol stack. Protocol stacks can be implemented either in hardware or software, or a mixture of both. Typically, only the lower layers are implemented in hardware, with the higher layers being implemented in software.

There are four layers in the IP suite (top-down):

Application Layer : The application layer is used by most programs that require network communication. Data is passed down from the program in an application-specific format to the next layer, then encapsulated into a transport layer protocol. Examples of applications are HTTP, FTP or Telnet.
Transport Layer : The transport layer's responsibilities include end-to-end message transfer independent of the underlying network, along with error control, fragmentation and flow control. End-to-end message transmission at the transport layer can be categorized as either__connection-oriented (TCP) or __connectionless (UDP):
- TCP is the more sophisticated of the two protocols, providing reliable delivery. First, TCP ensures that the receiving computer is ready to accept data. It uses a three-packet handshake in which both the sender and receiver agree that they are ready to communicate. Second, TCP makes sure that data gets to its destination. If the receiver doesn't acknowledge a particular packet, TCP automatically retransmits the packet typically three times. If necessary, TCP can also split large packets into smaller ones so that data can travel reliably between source and destination. TCP drops duplicate packets and rearranges packets that arrive out of sequence.
- UDP is similar to TCP in that it is a protocol for sending and receiving packets across a network, but with two major differences. First, it is connectionless. This means that one program can send off a load of packets to another, but that's the end of their relationship. The second might send some back to the first and the first might send some more, but there's never a solid connection. UDP is also different from TCP in that it doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order. All that is guaranteed is the packet's contents. This means it's a lot faster, because there's no extra overhead for error-checking above the packet level. For this reason, games often use this protocol. In a game, if one packet for updating a screen position goes missing, the player will just jerk a little. The other packets will simply update the position, and the missing packet - although making the movement a little rougher - won't change anything.

Note

Although TCP is more reliable than UDP, the protocol is still at risk of failing in many ways. TCP uses acknowledgements and retransmission to detect and repair loss. But it cannot overcome longer communication outages that disconnect the sender and receiver for long enough to defeat the retransmission strategy. The normal maximum disconnection time is between 30 and 90 seconds. TCP could signal a failure and give up when both end-points are fine. This is just one example of how TCP can fail, even though it does provide some mitigating strategies.

Network Layer : As originally defined, the Network layer solves the problem of getting packets across a single network. With the advent of the concept of internetworking, additional functionality was added to this layer, namely getting data from a source network to a destination network. This generally involves routing the packet across a network of networks, e.g. the Internet. IP performs the basic task of getting packets of data from source to destination.
Link Layer : The link layer deals with the physical transmission of data, and usually involves placing frame headers and trailers on packets for travelling over the physical network and dealing with physical components along the way.

Remote Procedure Calls

Over time, an efficient method for clients to interact with servers evolved called RPC, which means remote procedure call. It is a powerful technique based on extending the notion of local procedure calling, so that the called procedure may not exist in the same address space as the calling procedure. The two processes may be on the same system, or they may be on different systems with a network connecting them.

An RPC is similar to a function call. Like a function call, when an RPC is made, the arguments are passed to the remote procedure and the caller waits for a response to be returned. In the illustration below, the client makes a procedure call that sends a request to the server. The client process waits until either a reply is received, or it times out. When the request arrives at the server, it calls a dispatch routine that performs the requested service, and sends the reply to the client. After the RPC call is completed, the client process continues.

Threads are common in RPC-based distributed systems. Each incoming request to a server typically spawns a new thread. A thread in the client typically issues an RPC and then blocks (waits). When the reply is received, the client thread resumes execution. A programmer writing RPC-based code does three things:

Specifies the protocol for client-server communication
Develops the client program
Develops the server program

The communication protocol is created by stubs generated by a protocol compiler. A stub is a routine that doesn't actually do much other than declare itself and the parameters it accepts. The stub contains just enough code to allow it to be compiled and linked.

The client and server programs must communicate via the procedures and data types specified in the protocol. The server side registers the procedures that may be called by the client and receives and returns data required for processing. The client side calls the remote procedure, passes any required data and receives the returned data.

Thus, an RPC application uses classes generated by the stub generator to execute an RPC and wait for it to finish. The programmer needs to supply classes on the server side that provide the logic for handling an RPC request.

RPC introduces a set of error cases that are not present in local procedure programming. For example, a binding error can occur when a server is not running when the client is started. Version mismatches occur if a client was compiled against one version of a server, but the server has now been updated to a newer version. A timeout can result from a server crash, network problem, or a problem on a client computer.

Some RPC applications view these types of errors as unrecoverable. Fault-tolerant systems, however, have alternate sources for critical services and fail-over from a primary server to a backup server.

A challenging error-handling case occurs when a client needs to know the outcome of a request in order to take the next step, after failure of a server. This can sometimes result in incorrect actions and results. For example, suppose a client process requests a ticket-selling server to check for a seat in the orchestra section of Carnegie Hall. If it's available, the server records the request and the sale. But the request fails by timing out. Was the seat available and the sale recorded? Even if there is a backup server to which the request can be re-issued, there is a risk that the client will be sold two tickets, which is an expensive mistake in Carnegie Hall.

Here are some common error conditions that need to be handled:

Network data loss resulting in retransmit

Often, a system tries to achieve 'at most once' transmission tries. In the worst case, if duplicate transmissions occur, we try to minimize any damage done by the data being received multiple time.
Server process crashes during RPC operation

If a server process crashes before it completes its task, the system usually recovers correctly because the client will initiate a retry request once the server has recovered. If the server crashes completing the task but before the RPC reply is sent, duplicate requests sometimes result due to client retries.
Client process crashes before receiving response

Client is restarted. Server discards response data.

Distributed Design Principles

As Ken Arnold says: "You have to design distributed systems with the expectation of failure." Avoid making assumptions that any component in the system is in a particular state. A classic error scenario is for a process to send data to a process running on a second machine. The process on the first machine receives some data back and processes it, and then sends the results back to the second machine assuming it is ready to receive. Any number of things could have failed in the interim and the sending process must anticipate these possible failures.
Explicitly define failure scenarios and identify how likely each one might occur. Make sure your code is thoroughly covered for the most likely ones.
Both clients and servers must be able to deal with unresponsive senders/receivers.
Think carefully about how much data you send over the network. Minimize traffic as much as possible.
Latency is the time between initiating a request for data and the beginning of the actual data transfer. Minimizing latency sometimes comes down to a question of whether you should make many little calls/data transfers or one big call/data transfer. The way to make this decision is to experiment. Do small tests to identify the best compromise.
Don't assume that data sent across a network (or even sent from disk to disk in a rack) is the same data when it arrives. If you must be sure, do checksums or validity checks on data to verify that the data has not changed.
Caches and replication strategies are methods for dealing with state across components. We try to minimize stateful components in distributed systems, but it's challenging. State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component. If it can be reconstructed it's a cache. Caches can be helpful in mitigating the risks of maintaining state across components. But cached data can become stale, so there may need to be a policy for validating a cached data item before using it. If a process stores information that can't be reconstructed, then problems arise. One possible question is, "Are you now a single point of failure?" I have to talk to you now - I can't talk to anyone else. So what happens if you go down? To deal with this issue, you could be replicated. Replication strategies are also useful in mitigating the risks of maintaining state. But there are challenges here too: What if I talk to one replicant and modify some data, then I talk to another? Is that modification guaranteed to have already arrived at the other? What happens if the network gets partitioned and the replicants can't talk to each other? Can anybody proceed? There are a set of tradeoffs in deciding how and where to maintain state, and when to use caches and replication. It's more difficult to run small tests in these scenarios because of the overhead in setting up the different mechanisms.
Be sensitive to speed and performance. Take time to determine which parts of your system can have a significant impact on performance: Where are the bottlenecks and why? Devise small tests you can do to evaluate alternatives. Profile and measure to learn more. Talk to your colleagues about these alternatives and your results, and decide on the best solution.
Acks are expensive and tend to be avoided in distributed systems wherever possible.
Retransmission is costly. It's important to experiment so you can tune the delay that prompts a retransmission to be optimal.

How to write binary search correctly

2018-01-12T16:24:00+08:00

Introduction
Term explanation
Binary search
- Problem statement
- Implementation
  - Implementation 1
  - Implementation 2
Binary search variations
- Example 1
- Example 2
Conclusion
Reference

Introduction

Binary search is a straightforward algorithm to understand but it is hard to code it right. This is especially true given the forms of implementation on binary search can take on many. In addition, there are many problems can be solved by binary search with slight adjustment. Thus, it is not feasible to simply memorize the template of implementation without understanding. In this post, We illustrate how we can use the loop invariant in combination with pre and postcondition to code binary search correctly. One thing to note is that this post is drafted for practical purpose and it may not be theoretical rigorous.

Term explanation

Invariant

Mike's post has an excellent description of invariant, which I directly paste below:

An invariant is a property that remains true throughout the execution of a piece of code. It’s a statement about the state of a program — primarily the values of variables — that is not allowed to become false. (If it does become false, then the code is wrong.) Choosing the correct invariant — one that properly expresses the intent of an algorithm — is a key part of the design of code (as opposed to the design of APIs); and ensuring that the invariant remains true is a key part of the actual coding. Roughly speaking, if your invariant properly expresses the intent of the algorithm, and if your code properly maintains the invariant, then that is enough for you to be confident that the code, if it terminates, yields the correct answer. (Ensuring that it does in fact terminate is the job of the bound function.)

In short, an invariant is a condition that can be relied upon to be true during execution of a program, or during some portion of it. In practice, we formalize that invariant in terms of specific variables and values that appeared in the algorithm.

Bound function

Again, Mike's post has a nice explanation of bound function:

The bound function of a loop is defined as an upper bound on the number of iterations still to perform. More generally, you can think of it as an expression whose value decreases monotonically as the loop progresses. When it reaches zero (or drops below), you exit the loop.

Precondition and postcondtion

We borrow definition from Mike's post once again:

When you invoke a function — or, all right, a method — you have a sense of what needs to be true when you invoke it, and what it guarantees to be true when it returns. For example, when you call a function oneMoreThan(val), you undertake to ensure that val is an integer, and the function undertakes to ensure that the value it returns is one more than the one you passed in. These two promises — the precondition and postcondition — constitute the contract of the function. So:

The precondition is the promise that you make before running a bit of code;

The postcondition is the promise that the code makes after it’s been run.

Binary search

Problem statement

Before we actually implement our algorithm, we first need to make sure we understand the problem correctly. By understanding, we need to make sure we understand the given input and the expectation of the output. Those will be translated into the precondition and postcondition for our algorithm. Precondition helps us to design the algorithm and postcondition helps us to make sure we get the intended result.

Binary search problem is following: given an integer $X$ and integers $A_0, A_1, \dots, A_{N-1}$, which are presorted in ascending order, find $i$ such that $A_i = X$, or return $i = -1$ if $X$ is not in the input. There might be multiple of $i$ with $A_i = X$.

Implementation

The invariant for our binary search algorithm is: "if the target value $X$ is present in the array, then the target value is present in the current range." As mentioned above, invariant is formalized using specific variables and values. Here, we need to decide the representation of "current range". This is the place where the binary search has many ways of implementation. We use low and high to define the range and we use n to denote the length of array. There are several popular formalization of the "current range":

1. A[low] <  A[i] <  A[high]
2. A[low] <= A[i] <  A[high]
3. A[low] <  A[i] <= A[high]
4. A[low] <= A[i] <= A[high]

Number 2 and 4 are the invariants that behind two most popular implementation of binary search you can find on the internet.

Implementation 1

For 2, the equation means that $i \in [\text{low}, \text{high})$. Thus, low is initialized to 0 and high initialized to n. Thus, the invariant for this is "If $X$ is at any position $i$ in $A$ (i.e., $A_i = X$) then low <= i < high". The implementation represents this invariant is below:

def binarySearch(A, X):
    low, high = 0, len(A)
    while low < high:
        i = low + (high - low) // 2
        if X == A[i]:
            return i
        elif X > A[i]:
            low = i + 1
        else: # X < A[i]
            high = i
    return -1

In order to show the correctness of the above implementation, we need to make sure that we maintain the invariant through the execution of the function:

The first thing is to establish the invariant that’s going to hold true for the rest of the function, so we set the variables low and high to appropriate values (i.e. the lowest and one pass the highest indexes of the whole $A$).
We have ensured that the invariant is true when we first enter the loop. To show that stays true throughout the running of the function, we need to show that whenever it’s true at the top of the loop, it’s also true at the bottom. Here, our invariant is also the loop invariant.
The first statement of the loop (assigning to i) does not affect any of the variables referenced by the invariant, so it can’t possibly cause the invariant to stop being true.
What follows is a three-way if statement: we need to show that each of the three branches maintains the invariant.
- The first branch (i.e., X == A[i]) covers the case where we have found the target. At this point, we’re returning from the function (and therefore breaking out of the loop) so we don’t really care about the invariant any more; but for what it’s worth, it remains true, as we don’t change the values of low or high.
- The second branch (i.e., X > A[i]) is the first time we need to use non-trivial reasoning. If we’re in this branch, we know that the condition guarding it was true (i.e., X > A[i]). But because A is sorted in ascending order, we know that for all $j < i, A[j] <= A[i]$. This means that $X$ > all A[j] with $j <= i$. Thus, the lowest position the target can be at is A[i+1]. In addition, since our invariant is $i \in [low, high)$, we can set low to i+1.
- The third branch follows the same form as the second: since we know that X < A[i] and that $A[j] >= A[i] \forall j > i$, we know the highest position the target can be at is A[i-1]. However, our invariant insists that $i \in [low, high)$ with high being exclusive brace. Thus, we cannot set high to be i-1 and instead, we set it to i. Doing so, we maintain our invariant unchanged.
Since we’ve verified that all three branches of the if maintain the invariant, we know that the invariant holds on exiting that if. That means the invariant is true at the bottom of the loop, which means it will be true at the start of the next time around the loop. And by induction we deduce that it always remains true.
Finally, we break out of the loop when low < high is false, which means that the candidate range is empty (i.e. low == high). At this point, we know that the condition of the invariant (“If $X$ is at any position $i$ in $A$”) does not hold, so the invariant is trivially true; and we return the out-of-band value -1. ¹

Implementation 2

For 4, the equation means that $i \in [\text{low}, \text{high}]$. Thus, low is initialized to 0 and high initialized to n-1. Thus, the invariant for this is "If $X$ is at any position $i$ in $A$ (i.e., $A_i = X$) then low <= i <= high". The implementation represents this invariant is below:

def binarySearch(A, X):
    low, high = 0, len(A) - 1
    while low <= high:
        i = low + (high-low) // 2
        if X == nums[i]:
            return i
        elif X > nums[i]:
            low = i + 1
        else:
            high = i - 1
    return -1

Mike's post has done the similar invariant unchanged analysis, which I'll skip for this implementation. Until now, we haven't touched on the concept of "bound function" and "postcondition" in our analysis. However, that doesn't mean these two concepts are not important. Usually, "bound function" is used to prove our loop terminates (Mike's post talks about how to use "bound function" to show above implementation terminates; TopCoder link gives an example of why we need to show algorithm actually terminates). Next section, we'll see an example of checking postcondition is important to make sure we have correct return result.

Binary search variations

In this section, we'll talk about two examples that use binary search and see how we can implement the binary search correctly if we are able to maintain the invariant.

Example 1

The first example is LC153. Find Minimum in Rotated Sorted Array. Here, we are asked to find the minimum element given the rotated array. Suppose we have array [0,1,2,4,5,6,7] and there are seven ways of rotation:

0,1,2,4,5,6,7
7,0,1,2,4,5,6
6,7,0,1,2,4,5
5,6,7,0,1,2,4
4,5,6,7,0,1,2
2,4,5,6,7,0,1
1,2,4,5,6,7,0

The key observation is that if the middle value is greater than the left value, then the minimum value appears on the righthand-side of the middle value (i.e., for [4,5,6,7,0,1,2], the middle value is 7 and the minimum value 0 appears on the righthand-side of 7). Otherwise, the minimum value appears on the lefthand-side of the middle value. Like the previous section, we need to define left and right and then our invariant. Here, we define the left as the 0 index of the array and right as the last index of the array. Then, our invariant can be formulated as: "the index of minimum value (i.e., $i$) is always contained in the subarray denoted by left and right (i.e., $i \in [\text{left}, \text{right}]$)". In addition, our precondition is: an array sorted in ascending order is rotated at some pivot unknown. Our postcondition is: the minimum value. Then, our implementation is:

def findMin(self, nums: List[int]) -> int:
    left = 0
    right = len(nums) - 1
    while left < right:
        if nums[left] < nums[right]:
            return nums[left]
        mid = left + (right - left) // 2
        if nums[mid] >= nums[left]:
            left = mid + 1
        else: # nums[mid] < nums[left]
            right = mid
    return nums[left]

We verify our invariant unchanged as follows:

The invariant is unchanged before the first if when nums[left] < nums[right]. Here, we know that there is no rotation in the array (i.e., [0,1,2,3,4,5,6,7]) and we return nums[left], which keeps invariant and satisfy the postcondition.
For the second if: when nums[mid] >= nums[left], by our observation, we know that the minimum value appears on the righthand-side of the middle value. mid value cannot be the minimum because nums[mid] >= nums[left]. Thus, we can set left to mid + 1 and still maintains our invariant unchanged.
For the case when nums[mid] < nums[left], we know that the minimum value appear on the lefthand-side of the middle value. However, mid value can be the minimum value and thus we set right to mid to maintain the invariant.
The loop exit when left == rigt. Our invariant is $i \in [\text{left}, \text{right}]$, which is different from postcondition requires: 1. postcondition asks us to return the actual minimum value instead of the index 2. We still haven't found the minimum value yet. The invariant states that $i \in [\text{left}, \text{right}]$, which is $i \in [\text{left}, \text{left}]$ given left == right. Thus, minimum value can only appear on index pointed by left and to meet the postcondition requirement, we return nums[left].

Usually, for the iterative algorithm, the invariant is the same as the loop invariant. However, this example also shows the power of the postcondition. In the basic binary search above, maintaining the invariant naturally gives us the result that meets the postcondition requirement. However, for this example, our invariant doesn't give us the required result. By checking the postcondition, we know what's the expected return is and it also indicates how we can find one.

Example 2

The second example we are asked to find the index of the first number that is greater than the given target number in the array. Like basic binary search problem, the array is sorted in ascending order. As always, let's consider some examples for this problem. Suppose we are given an array [0,1,5,7,8,10,12,15], if the target number is 3, we should return 2, which is the index of the first number that is greater than 3 (i.e., 5). What about the target number is 16? In this case, there is no number in the array is greater than the target number, and we should return -1.

What's the invariant for this problem? Similar to the other binary search problem, the invariant is "the index $i$ of the first number that is greater than the given target number in the array is in $[\text{low}, \text{high}]$". Thus, we can initialize low to be 0 and high to be n. Note that we set high to n instead of n-1 due to the need to maintain the invariant: for the case when there is no number in the array that is greater than the target number, we can think about the first number that is greater than the target number happens one past the last index. Then, by our invariant, we need to include that number. Thus, we set our high to n instead of n-1. The implementation is following:

def findFirstGreaterTo(nums, target):
    low, high = 0, len(nums)
    while low < high:
        mid = low + (high - low) // 2
        if nums[mid] <= target:
            low = mid + 1
        else: # nums[mid] > target
            high = mid
    return -1 if high == len(nums) else high

The invariant is maintained as follows:

Invariant is unchanged until the first if: nums[mid] <= target. There are two cases here: when nums[mid] < target, since the array is sorted in ascending order and we are looking for the number that is greater than the target number, thus we can set low to mid + 1 without breaking the invariant. When nums[mid] == target, again we are looking for the first number that is greater than the target number, thus we can set low to mid + 1.
When nums[mid] > target, we can set high to mid to maintain our invariant.
The loop exit when low == high, since our invariant is the index of the first number greater than the target number is in $[\text{low}, \text{high}]$, which is $[\text{high}, \text{high}]$ in this case. Our invariant still holds.

Note

Here is a reasoning why low cannot be greater than high when the loop exits: suppose that low is greater than high on loop exit. The only possible case that low is greater than high is when $\text{low} = {mid^0}+1$ and $\text{high} = {mid^{-1}}$ where ${mid^0}$ means the mid value of the current iteration (i.e., immediately before loop exit) and $mid^{-1}$ means the mid value of the previous iteration. Then, we want to show $mid^0 + 1 > mid^{-1}$. $\text{mid}^0$ is cacluated from $(\text{low}^{-1} + \text{mid}^{-1})/2$ assuming in the last iteration, low is changed to $\text{mid}^0+1$ (the other case works similarly). $\text{low}^{-1}$ is less than $\text{mid}^{-1}$ because otherwise we already exit the while loop. Thus, $\text{mid}^0$ is smaller than $\text{mid}^{-1}$. The minimum difference between $\text{mid}^0$ and $\text{mid}^{-1}$ is 1 and thus low cannot be greater than high.

Once we exit the loop, we need to check our postcondition once again. Our postcondition asks us the index of the first number that is greater than the target number if it is in the array and -1 otherwise. However, during the initialization of high, we consider n represents the case when no such number exists and at the same time, satisfies our invariant. Thus, before returning the result, we need to check whether high within the index range of the given array to satisfy the postcondition constraint.

Conclusion

In this post, we take a look at the technique that helps us implement the binary search correctly: maintain the invariant. Also, we emphasize the importance of the postcondition to help us get the returen result correctly. We haven't empahsized the importance of bound function in the post but we should consider it as well. There are two ways to check the loop indeed terminates: one is through reasoning similar what we have done in invariant analysis (Mike's post and Paul's post give us examples on how to do that); another way is by considering some special cases like there are only two elements in the array suggested by the TopCoder article below.² Some details are left out in this post: 1. why use mid = low + (high - low)//2 instead of mid = (low + high) // 2 ³ 2. Use A[low] <= A[i] < A[high] as an invariant is better than A[low] <= A[i] <= A[high]. ⁴

Reference

StackOverflow - What are the pitfalls in implementing binary search? (has a short explanation of the importance of invariant with an example)
StackOverflow - Binary search and invariant relation (has a nice example)
TopCoder - Binary Search (think about binary search through predicate inquiry)
Loop Invariants and Binary Search (Nice illustration (i.e., "safe place") between precondition, loop invariant, postcondition)
Invariants and Proofs of correctness (Gives relative formalization of our invariant analysis in the post. Essentially, invariant is the same as the inductive hypothesis.)
Binary search and loop invariants (Slide 6 is good: it tells us how to setup the terminate condition of the loop. Loop terminates when the invariant looks like the postcondition. This slide also gives example of writing terminate condition as while(low != high - 1)).
Loop invariants

The rigorous analysis on this binary search implementation can be found from Frank's lecture notes on binary search. He encodes contract directly into the implementation, which displays the loop invariant, precondition, and postcondition directly. ↩
Topcoder article offers an equivalent thought of thinking about invariant: predicate inquiry. For example, our last example invariant can be translated into the predicate: "Is nums[i] greater than the target number?" Then, each element is tagged with either "yes" or "no". Then, the problem asks us to find out the first element has tag "yes". The two elements special case shown in the article is "no,yes" array. Details see the article. ↩
The explanation can be found on Rosettacode page and Frank's lecture notes on binary search on page L6.12. ↩
A comment on Mike's post sheds some insights. ↩

Crowdsourcing readings

2017-12-30T12:30:00+08:00

Intro
"The Human Processing Unit (HPU)"
"Soylent: A Word Processor with a Crowd Inside"
"Crowd-based Fact Checking"
"Improving Twitter Search with Real-Time Human Computation"
"Platemate: crowdsourcing nutritional analysis from food photographs"
"An introduction to crowdsourcing for language and multimedia technology research"
"ImageNet Large Scale Visual Recognition Challenge"
"Visual Dialog"
"VQA: Visual Question Answering"
"Zooniverse: observing the world's largest citizen science platform"
"Practical Lessons for Gathering Quality Labels at Scale"
"Crowdsourcing user studies with Mechanical Turk"
"Automan: A platform for integrating human-based and digital computation"
"Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms"
"Instrumenting the crowd: using implicit behavioral measures to predict task performance"
"MmmTurkey: A Crowdsourcing Framework for Deploying Tasks and Recording Worker Behavior on Amazon Mechanical Turk"
"AMAZON'S TURKER CROWD HAS HAD ENOUGH"
"The Future of Work: Caring for the Crowdworker Going It Alone"
"The Good, The Bad and the Ugly: Why Crowdsourcing Needs Ethics"
"Information Extraction and Manipulation Threats in Crowd-Powered Systems"
"Exploitation in Human Computation Systems"
"Dirty Deeds Done Dirt Cheap: A Darker Side to Crowdsourcing"

Intro

This post contains reflections for part of the papers I have read in Prof. Matt Lease's Human Computation course. To get a full list of papers, see the course schedule.

"The Human Processing Unit (HPU)"

Note

The Human Processing Unit (HPU) Davis, J. et al. (2010). Computer Vision & Pattern Recognition (CVPR) Workshop on Advancing Computer Vision with Humans in the Loop (ACVHL). 8 pages.

This paper looks interesting because it tries to develop a hybrid framework that, at least conceptually, allows integration between manpower (i.e., HPU) and computer power (i.e., CPU). I think the biggest accomplishment this paper has achieved is providing a new perspective to evaluate old problems. By directly comparing human with the computer, the authors essentially take a retrospective view on the development of "computer" term, which starts out as a way to describe an occupation and then gradually evolved into a term for a specific type of machine. They suggest that it is time to view "computer" as a human-integrated electronic device in order to solve the problems that cannot be solved perfectly by the CPU-driven computer alone. Their idea of applying old terms in a new context really makes me think whether this paradigm can be applied to other fields of research.

However, there are several concerns I want to raise when I read through the paper. One of the contributions claimed in the paper is that "characterizing the HPU as an architectural platform". I think the statement is too aggressive. For example, when the authors use color labeling task as a way to demonstrate the accuracy between HPU and CPU, I find that essentially they outsource the task that should be done by machine learning algorithms to human and CPU just perform some basic statistical work. It seems that the paper suggests us to abandon the use of machine learning algorithms for certain tasks and let HPU do the work. I think HPU is a way to improve machine learning algorithms from 90% accuracy to 100% accuracy. We still want the CPU-based algorithms to play the major role in the system because the CPU-based algorithm is proved to be stable and low latency in a well-tuned production system. In addition, some characterization of HPU cannot be generalized, which prevents people from benchmarking HPU against CPU in a straightforward way. For example, the paper shows an empirical study of cost versus accuracy on a specific task, which cannot be fully generalized to other scenarios. This makes authors' claim on crowdsourcing as a new architecture for production systems vulnerable because there is no clear way to estimate the performance of HPU. Furthermore, many critical questions related to securities and performance need to be addressed before we can use HPU in a production system. For example, what would happen if the task sent to the HPU contains confidential information and this piece of information is critical for people finishing the task? How do we design the task to workaround this problem? How do we handle the problem that HPU can take several minutes, several hours or even several days to finish a certain task? How can we ensure the quality of HPU computation result?

"Soylent: A Word Processor with a Crowd Inside"

Note

Soylent: A Word Processor with a Crowd Inside Bernstein, M. et al., UIST 2010. Best Student Paper award. Reprinted in Communications of the ACM, August 2015.

Overall, I think this paper can be treated as a concrete example to support the HPU paper's idea because Soylent uses crowdsourcing to carry out a complex but meaningful task: editing, which goes beyond the commonly-seen crowdsourcing task: labeling the training data for machine learning algorithms. The tool shows an example of how powerful crowdsourcing can become once we get the HPU and CPU (i.e., word processor) fully integrated. One example I really like is about crowdsourced proofreading. Unlike the clueless Microsoft Word message "Fragment; consider revising", with the help from the crowd, we can get the meaningful explanation of the mistakes for different errors we make in the writing. This example also surprises me because I'm wondering how many crowd workers will take much effort writing out the explanation of the errors. Unlike usual crowdsourcing task, which is about clicking several buttons for the survey, writing the explanation can be much more demanding. In addition, I really like the "related work" section of the paper because it lists several crowdsourcing examples and I actually want to try some of them: for instance, the HelpMeOut tool for debugging code through collecting traces.

There are a couple of questions and thoughts I want to list out when I read through this paper. One is that I'm wondering how effective the Crowdproof will be if we do not pay out any money at all? In HPU paper, the authors use shirt color task as an example to show that there is no strong correlation between how much you pay for the crowd and the accuracy you can get from the task. I'm wondering if this statement will hold under crowdsourced proofreading setting. In addition, I want to learn more about The Human Macro because one of the design challenges, as pointed out in the paper, is to define the task scope for crowd worker. However, from the paper, it seems that all of the responsibility falls on the user's shoulder. Is there any way from the system-side that can help the user better tailor their task for the crowd worker? When the authors talk about how to prevent the worker from being lazy on the task, they cite a paper by Kittur et al. that says adding "clearly verifiable, quantitative questions". I am wondering how can they do that in their system because if they use this methodology, then they must use a way to automate the question generation because once the writer triggers the Soylent, the crowdsourcing tasks should be triggered automatically, which requires the question gets automatically generated. Question generation can be hard because it needs some level of text comprehension and I am really curious what is the type of method the authors use in their system. Also, similar to HPU paper, this paper does not dive into the details of how we can tackle the privacy (security) and the latency issues in order to make the system robust in real production.

"Crowd-based Fact Checking"

Note

Crowd-based Fact Checking. An T. Nugyen.

The big picture of this paper is clear to me. The author wants to automate the process of determining the correctness of claims, which referred as fact-checking in the field. Initially, I was confused about how the fact-checking works in reality. However, after browsing through some websites listed in the paper, for example, Politifact, the goal that the author tries to achieve becomes clear to me. In addition, I can tell how the author knits the crowdsourcing with the machine learning algorithms to develop a hybrid method. Basically, there are two sets of training data that the author leverages: one has the journalist label and the other has the label from crowdsourcing. Then the data with journalist label is for the off-line scenario, which does not require the machine learning algorithm gives the real-time fact-checking result. Crowdsourcing data is used as a way to approximate the journalist "gold standard" in the online scenario, where we trade some level of accuracy for the performance of fact-checking. This paper also links to the Soylent paper in the sense that this paper also mentions how to prevent "lazy worker" scenario from happening. Specifically, inside "Crowdsourced labels collection", the author requires workers to give an explanation to their label.

My questions for this paper majorly come from the technical perspectives. I have some basic understanding of PGM and BN but clearly, that is not enough for this paper. EM algorithm, Gibbs sampling, Variational Inference, softmax are concepts that confuse me the most. In addition, the unfamiliarity of the field makes me wonder what exactly is the "independent models" that the author refers to when he talks about the baseline for his new model. Those questions lead to a bigger and more generic question regarding research and this course: how should we approach the mathematical-dense paper like this in the early phase of the graduate study (i.e., first semester of graduate school)? Hopefully, during the lecture this week, we can have some time to talk about this question. In addition to those technical questions, I'm wondering how good the variational method works. As mentioned in the "Results" section, the difference between the variational method and the baseline diminishes as more crowd labels get collected. This makes me wonder if the new model is really as good as the author claims. Are we paying too much price (i.e., time and computational cost) to pursue a mediocre complicated model when a simple model can deliver the similar performance?

"Improving Twitter Search with Real-Time Human Computation"

Note

Improving Twitter Search with Real-Time Human Computation. Edwin Chen and Alpa Jain. Twitter Engineering Blog. January 8, 2013.

This article is interesting because it offers a real world example of how we can integrate crowdsourcing into the real production system. The problems associated with crowdsourcing are usually related to the performance and latency. Performance often refers to the accuracy of tasks that crowd workers finish and latency usually measures the amount of time that takes from the tasks start to finish. In the papers I have read so far, researchers merely come up with good solutions to tackle these two issues and thus, the architecture or the product that they come up cannot be directly applied in the real world. That's why this article looks interesting because Twitter actually uses crowdsourcing in their production system. The way that Twitter handles these two issues is centered around the people. Quite often, when there is a crowdsourcing task, people immediately think about Amazon Mechanical Turk or Crowdflower. However, what Twitter does is that they use these third-party platforms as backups and they mainly use "custom pool", which contains a group of crowd workers (or "judges") that are highly specialized to Twitter product scenarios. This solution may look expensive initially because "for many of them, this is a full-time job" and thus, I hardly think Twitter just pay around 0.07 dollars for tasks these people finish. However, I think this solution saves a lot of economics cost. For example, as pointed out in this article, those judges are recruited to handle the short-term search query spike and annotate the new trend of the search query. This means the latency is the key here: it is not acceptable for a crowdsourcing task spends several hours or days to finish, which are commonly-seen for standard crowdsourcing tasks through those third party platforms. Furthermore, even the crowd workers fair quickly, the accuracy of the task result can hardly be guaranteed because crowd workers can possibly misunderstand the meaning of the query due to the sudden appearance of the trend. From the quality control perspective, we devote a lot of statistical methods or human intervention to improve the quality of crowdsourcing jobs in a standard setting, which may seem unnecessary for Twitter settings because those people in the pool are highly-trained professionals. If we think from Twitter perspective, any mistake has the potential to cost multi-million advertisement revenue and thus, it is not hard to imagine why Twitter chooses to use their own in-house "turkers".

Note

There is also an article called Moving Beyond CTR: Better Recommendations Through Human Evaluation, which comes from one of the author from above article, is also worth checking out.

"Platemate: crowdsourcing nutritional analysis from food photographs"

Note

Platemate: crowdsourcing nutritional analysis from food photographs Noronha, Jon, Eric Hysen, Haoqi Zhang, & Krzysztof Gajos. UIST 2011 pp. 1-12.

The paper is interesting from several perspectives. First, the problem described in the paper is important to tackle. There are plenty of food tracking applications online but many of them require the tedious manual logging, which requires a fair amount of effort from the User. Can we make the whole process easier to people? In addition, many HIT design tricks have been mentioned in the paper. For instance, when we ask the crowd workers to identify food items in a photo, we may want to provide several examples to them to guide their work. Another trick mentioned is that we may need to pay attention to the subtlety of the task design in the sense that we want to break the task into its atomic form. For example, when the authors ask the workers to identify the food inside the database, the workers have two tasks mentally: identify the food and locate the food in the database. We want these two tasks carried out separately by different groups of Tuckers. One trick to my amusement is to disable keyboard quick selection, which is quite important to prevent "lazy worker" but easy to forget during the task design.

There are also several questions I want to ask. Latency is still a big issue for human computation. Specifically for this paper, the nutrition estimates will return to the user within a few hours. In the evaluation section, the average time takes to finish analysis is 94.14 minutes, which is quite long. In addition, this service costs $1.40 per photo, which can cost $1533 per year (i.e., three meals per day for 365 days). Given the cost and performance of the tool, I can hardly imagine this application will become popular to a wide audience. This leads to the problem caused by the methodology of research. This paper puts heavy weights on the human computation and less on the computer-based algorithmic approach. This is confirmed by the author inside the discussion section of the paper. To me, Kitamura et al. really gets close to solving the problem: they can successfully identify whether the photo contains food and the categories of food. The major piece left out is to identify the specific foods and the actual intake. I think the former one can be done with computer approach as well and the latter one may invoke crowd sourcing. Doing this way may improve the performance of the whole application and reduce the cost of invoking too many crowdsourcing tasks. Furthermore, inside the "Step 1: Tag", the authors mention that "a combination of machine and human computation is used to select the better box group" without actually mentioning the exact methodology they use. I'm wondering what exactly the method is. In addition, the paper has limitation rooted in Amazon Mechanical Turk. The problem is that only the Americans can use this platform and thus, inevitably, a certain bias will introduce to the research. In particular, this paper states that "We chose to require American Turkers due to the unique cultural context required for most elements of the process." In other words, PlateMate is only applicable to the food that is well-understood by the American culture, which is partially confirmed by the evaluation photos that the authors use. All those photos contain the food that is commonly-seen in the United States. What about the food from other countries with a dramatically different cultural background? Can the component of the food be still easily understood by the American-based crowd? In my opinion, the answer is probably no and the nutrition estimate accuracy may drop significantly if we use the tool from different parts of the world. The limit of Amazon Mechanical Turk, which seems to be the de-facto standard for crowdsourcing research nowadays, poses the constraint on the research result as well. How do we accommodate this issue is worth to think about.

"An introduction to crowdsourcing for language and multimedia technology research"

Note

An introduction to crowdsourcing for language and multimedia technology research. Gareth Jones. PROMISE Winter School 2012. Springer, pp. 132-154.

This paper is centered around using crowdsourcing as a way for data collection. Specifically, it targets at language and multimedia technology research, which majorly involves natural language processing and computer vision respectively. The paper provides extensive examples of how crowdsourcing can be utilized as a way for gathering the data. There are several good points made in this paper. First of all, the author provides examples on the definition of crowdsourcing. Crowdsourcing can be applied in various fields. Quite often, I have a hard time to come up examples that do not belong to crowdsourcing. The example provided by the author is the crowd management at a sports event, in which recruiting more members from the crowd is not ideal. The paper also shows the recurring principles in crowdsourcing task design: "identify an activity which is amenable to being broken into small elemental tasks". Lastly, the paper provides many pointers to the crowdsourcing resources and the papers that focus on the specific area of crowdsourcing task design (i.e., Payment and Incentives), which are good for future in-depth study.

There are several questions I want to ask after reading through the paper. I'm still confused about the exact mechanism of the quality control of the crowdsourcing task. In the paper, the author states that "Once the quality of the work has been checked, the requester then has the option to accept the work and make payment to the worker, or to reject it, in which case payment is not made." I'm wondering if the requester can exploit this checking-submission mechanism to gather the data while not paying out the money. Since the work can be checked, the requester can duplicate the work result and rejected the work. Certainly, this will damage the requester's reputation in the long run, but the requester can use this mechanism as a way to do budget control. Another question regarding quality control is how we can check the quality of the work without traversing all the submission. The paper does not show how RSR task handles this issue. One way the author suggests to do quality control is to come up the "honey pots" questions, which have known answers to the requester. I'm wondering what fraction of the work that contains "honey pots" questions will cause the false positive. Based on my experience with CrowdFlower, I feel some "honey pots" questions are too hard to get right. Then, under this situation, how we can distinguish between spammers and the workers that actually put the effort into the task.

"ImageNet Large Scale Visual Recognition Challenge"

Note

ImageNet Large Scale Visual Recognition Challenge.

Latency is a big problem in crowdsourcing. Usually, the crowdsourcing tasks will take several days or weeks to finish. Is there any way to speed up the whole process and reduce the latency of the response without sacrificing much on the quality of the tasks? One idea is to build a cache between the application and the crowdsourcing platform, which using the machine learning techniques to identify the similarity between two given tasks and using the crowd to do the optional verification of the two tasks to make sure those tasks are indeed similarly or even the same. Then, we can reuse the task result from the previous to speed up the whole crowdsourcing process.

ImageNet is a legendary example of crowdsourcing. There is a plenty of media coverage on this challenge. The paper shows how the team from the Stanford compose this benchmark dataset and how this dataset changes the landscape of the computer vision. The essential task of benchmarking dataset is that it has to provide sufficient accuracy so that researchers can use it to train and evaluate their learning algorithms. This necessarily poses a big challenge to the designing of the crowdsourcing task: how do we collect 1,461,406 images and correctly annotate them for different computer vision task?

One principle is to design the crowdsourcing tasks that targets at specific goals. There are three goals for this dataset: image classification, single-object localization, and object detection. For image classification dataset annotation, we can utilize the voting system for crowdsourcing task. However, for single-object localization annotation, we may want to apply different crowdsourcing principles by making the tasks “as simple as possible” and “has a fixed and predictable amount of work”. In addition, insights from the goal may help us to design the crowdsourcing task better. One example is the authors find out that “different categories require different levels of consensus among users.” For example, the number of crowd workers required to verify cat images is less than the number of crowd workers required for Burmese cat images. This can save the researchers a decent amount of the budget on crowdsourcing tasks. Another example on this matter is the hierarchical algorithm they developed for multi-class annotation. In addition to those details, I find some interesting papers for my future reading on this topic: “Crowdsourcing annotations for visual object detection” and “Scalable multi-label annotation” are interesting to check out. Lastly, the authors compare the machine-based algorithm with the human annotators and show that how human can still beat the computer in the computer vision task. I think it is a strong evidence in showing how good the HPU can be.

For a survey paper like this, some details get omitted but are interesting to ask from crowdsourcing perspective. When the authors evaluate the image classification dataset annotation, they “manually checked 1500 ILSVRC 2012-2014 image classification test set images”. My question is how do they sample those 1500 images? How do they translate 5 annotation errors from those images into “99.7% precision”? In addition, they “visually” check the accuracy of bounding box for single-object localization dataset, I’m wondering if this checking procedure is rigorous enough? Computer is known for its human-unmatched level of accuracy. The bounding box may be good in terms of human eyes but may not be true from computer’s perspective.

"Visual Dialog"

Note

Visual Dialog.

Visual Dialog is extremely similar to usual chatbot except that the chat is centered around the images provided. Unlike VQA, visual dialog focuses on the dialogue which requires a list of chat history regarding the picture that both the bot and the human talks about. I play around the live demo of this paper online, and I find that the bot is extremely good at image caption. I uploaded a cat picture looks like below and the bot immediately captions the picture with the title “a cat is standing in a window sill”. However, in terms of the details of the image, the bot seems not really good at it. One question I asked is “what’s inside the sill?” and the bot replied “orange and white”. Then, I asked “How many cats are there?” the bot replied “2”. To me, the bot has difficulty to do object detection correctly and it is really hard for me to keep the conversation going. One thing I find this paper is really cool is their AMT task design. Before reading this paper, my idea is limited in terms of what kinds of the task that can be performed on Amazon Turk. I have never thought of “hosting a live two-person chat on AMT” and the authors even build their own “backend messaging and data-storage infrastructure based on Redis messaging queues and Node.js”. The interface they design is quite clean and can meet their goal. However, I think they can make this into a game to motivate people to actively get involved in the conversation. Besides the task design, I get a sense of what we should talk about when we collect some data for the paper. Basically, we need to give out the statistics and analysis of the data set by listing out the components of the data set, the distribution of the questions we asked or collected, the answers we got, and so on.

"VQA: Visual Question Answering"

Note

VQA: Visual Question Answering

VQA is an AI task that combines the Computer Vision (CV) with the Natural Language Processing (NLP). The user can ask the questions that are best answered based on the image provided. The paper has several interesting points. The first is their crowdsourced task design. The researchers try to pose the questions in a way that can “elicit the most interesting and diverse questions”. One sentence I really like in their “smart robot” interface is “Your task is to stump this smart robot!” I certainly want to come up tricky questions that can beat the researchers’ “evil” robot. From this, I learned that the instruction text also impacts the quality of the crowdsourcing tasks. The goal to the crowdsourcing in this task is to get as many diverse questions as possible. Carefully crafting instruction text is one way, and the other way is to better design the questions appeared in the survey. One trick the researchers use is “when writing a question, the subjects were shown the previous questions already asked for that image to increase the question diversity”. One important idea I learned from this paper is doing the question analysis of the dataset. In the paper when the researchers come up the question answers for the crowdsourcing task, they employ the machine learning technique to compose their “18 candidate answers”. For example, they “gather additional answers from nearest neighbor questions using a bag-of-words model”. In addition to the question analysis, the researchers also study the impact of the task interface design on their data set collected. In Appendix I, the researchers study the spatial relationship between the text and the image. These design practices make me think if there is a way to decide the good strategy of the task design. Statistically speaking, what’s the effective way to design a survey (i.e. crowdsourcing task).

"Zooniverse: observing the world's largest citizen science platform"

Note

Zooniverse: observing the world's largest citizen science platform. Robert Simpson et al., WWW 2014 companion, pp. 1049-1054.

Zooniverse project looks both similar and different from the crowdsourcing platforms that we have seen so far in the semester. The similarity comes from both the Zooniverse and Amazon Mechanical Turk, for example, involve the data collection. On both platforms, the crowd can “identify, classify, mark, and label” the data. However, Zooniverse is different from AMT in the sense that they “brand” their platform as a place to perform “citizen science”, which can bring much more potential from the crowd than AMT, which is perceived as a place to perform the job and get the money. One example is the “Mutual Muses” project that the crowd is asked to “transcribe Correspondence by Critic Lawrence Alloway and Artist Sylvia Sleigh”. Surely, on AMT, requesters can still post the tasks that are the same to this. But, branding the data collection process as a citizen science project can make the crowd much more attentive to the work they do. Unlike other crowdsourcing platforms, the Zooniverse takes a holistic approach to the crowd. The projects I have browsed so far do not have “quiz mode” but well-crafted tutorials. The interface design is much more modern and the project itself is “cooler” than the tasks found on the AMT or Crowdflower. One thing I notice that is on the Zooniverse, people are told about the mission of the project but on the AMT, the crowd is barely informed what the data is used for, which makes the platform feel like a place to earn some extra cash not a place to help with research. The architecture introduction of the Zooniverse platform is uninteresting from the crowd perspective but is definitely worth reading for people who want to build a platform for large-scale crowdsourcing tasks. One point I really like is that the authors are well-aware of “The creation of engaging user experiences is essential to getting the best from volunteers online”, which makes me much appreciated after working with Crowdflower and AMT platforms. Zooniverse’s smooth working process and the mission of the projects make me forget that most of the work I do on the platform is actually without compensation. Near the end of the paper, the authors discuss the potential of the data visualization. I think it is really a great idea especially when the platform treats the crowd not as “workers” but as “researchers”.

"Practical Lessons for Gathering Quality Labels at Scale"

Note

Alonso, O. (2015). Practical Lessons for Gathering Quality Labels at Scale. 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1089-1092).

Alonso (2015) gathers the practical lessons for designing large-scale crowdsourcing task. There are many good tips mentioned inside the paper that are invaluable resources for the crowdsourcing task design. For example, besides the instruction rules, showing examples is a good way to make the crowd worker productive with a meaningful result. Also, the HIDDEN structure proposed in the paper answers my doubts about how to check the performance of the workers if the gold set is missing. In addition, the “honey pots” strategy is mentioned inside the paper and the author uses it as a way to remove incapable workers at the beginning of the task and to perform random checks during the task execution. Some questions are also worth asking. For example, I am very glad to see that the bias problem gets mentioned by the author in his “data-worker-task” debugging framework. However, how we can handle those bias is a totally different issue and somehow the author does not dive into details. I think that is may be due to the complexity of the issue and the detailed discussion may not fit into the whole framework. However, without some concrete suggestions on how to handle the bias, the proposed framework does not look concrete for me. Another question comes from the comparison between inter-rate static and the percentage agreement. Specifically, I’m wondering what is the drawback of using the percentage agreement statistic? How can we measure whether an inter-rater static is good or bad? Overall, I really like this paper as it provides a very crucial overview on the crowdsourcing task design and it gives a list of questions that we researchers may want to ask during the task design.

"Crowdsourcing user studies with Mechanical Turk"

Note

Kittur, A., Chi, E. H., & Suh, B. (2008, April). Crowdsourcing user studies with Mechanical Turk. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 453-456).

Kittur et al., (2008) perform an actual user study on the crowdsourcing. Like Alonso’s paper, many tips for crowdsourcing task design give out in the paper: integration of the verifiable questions, minimization of the effort gap between spammers and the good workers, and various ways of detection of the suspect responses. However, the authors in “Crowdsourcing user studies with Mechanical Turk” also carry out the tips they list into two experiments. I'm really surprised to learn how significant the crowdsourcing task design can be on the end result. In addition, the authors also provide some measurements of the crowd we are facing on the platform. Specifically, instead of commonly-believed “widespread gaming”, only a small fraction of people are actually spammers. The rest of the crowd do not have the incentive to finish the task carelessly at the very beginning but might become eventually due to the ill-design of the crowdsourcing task. This observation of the crowd further confirms the importance of the crowdsourcing task design.

"Automan: A platform for integrating human-based and digital computation"

Note

Automan: A platform for integrating human-based and digital computation. Barowy, Daniel W., Charlie Curtsinger, Emery D. Berger, and Andrew McGregor. Communications of the ACM 59, no. 6 (2016): 102-109.

This paper takes a different angle to look at the crowdsourcing task design, which involves designing a programming language that wraps the crowdsourcing tasks details under function calls. The greatest benefit in doing so is that it provides a unified interface to the programmer so that the programmer does not need to worry about the underlying crowdsourcing task design too much, which makes the whole program portable. In other words, we can tune the configuration of AUTOMAN to make the whole application works for different goals using different crowdsourcing platforms. Another benefit provided by the crowdsourcing programming language abstraction is the better task design experience. Normally, without the abstraction, we may need to design both the task content and also comes up the mechanism to perform the quality control. However, now, with the benefit of function call implementation of the crowdsourcing, we can put all our effort on the crowdsourcing task content design (i.e., Specific questions to the crowd) instead of designing the whole workflow from the beginning to the end, which can save people’s a fair amount of time. There are still some questions worth asking about the programming language implementation. For example, for the free-text input, the current implementation uses the pattern matching to verify the worker input and perform the probability analysis for the quality. However, there seems no semantic analysis of the response provided by the worker. This reflects one of the shortcomings for the paper’s programming language implementation for the crowdsourcing task: some parts of the crowdsourcing tasks cannot be fully automated as part of the system. In other words, for the free-text input, people still need to go into the actual text to see the semantic meaning of the response. This is especially important when the programming language supports quality control mechanism. Without understanding the meaning of the response, one can hardly assess its quality.

"Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms"

Note

Donna Vakharia and Matthew Lease. Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms. In Proceedings of the iConference, 2015.

The paper provides a detailed comparison among seven crowdsourcing platforms. One noticeable phenomenon I observed in the papers we have read so far is that all of them perform their work on AMT. That makes AMT a de-facto standard for crowdsourcing and human computation research. One possible motivation for researchers using AMT for the work is that AMT is well-understood by the research community. The limitation and functionality of the platform is known to the community and the researcher do not have to go through the platform introduction before jumping into their actual contribution. However, as suggested in the paper, using AMT solely can problematic. One consequence is that the diversity of the research will be limited due to the constraint posed by the platform. Thus, in order to encourage the community to adopt various crowdsourcing platforms for the research, a survey of existing crowdsourcing platforms is a must and I’m very happy that I can read a paper like this one. Survey of the existing platforms can also provide a practical guide for researchers to pick the platform that is best suited for their goals. For example, we do not have to reinvent the wheels by building some fancy infrastructure for complex tasks on AMT if other platforms like WorkFusion and CloudFactory have already built the tools that can be used out-of-box. However, I think the paper can encourage the researchers to try out different platforms more if it can provide some typical usage scenarios that are best suited for each platform. Some platform like CloudFactory can be quite different from the AMT in the sense that CloudFactory puts more emphasis on the enterprise users than the individual requesters.

"Instrumenting the crowd: using implicit behavioral measures to predict task performance"

Note

Instrumenting the crowd: using implicit behavioral measures to predict task performance. Jeffrey Rzeszotarski and Aniket Kittur. ACM UIST Conference, pp. 13-22, 2011.

In the paper, the authors talk about how we can utilize the meta information about MTurk tasks to predict the quality of the work done by the crowd. This is very useful for quality control because we always want to remove the lazy workers from our workforce. That is why we study different quality control techniques throughout the semester. In the paper, the authors think by logging interface data (i.e., mouse, keystrokes, response time), we can predict the quality of the work using machine learning techniques. There are a couple of questions with this method. First, there is a natural latency in making prediction for the quality control purpose. In the paper, the logging data is uploaded via an opt-in button near the end of the task and then they perform all the data mining work backend. However, the spammers are unlikely to submit their logging data to the remote and the data mining work takes time to finish. During the time, the good workers may switch to other tasks. In the paper, the authors think that one prediction model trained for one task can be applied towards the similar tasks. I think it can remove the latency to some degree but it is hard for requesters to distinguish when the direct application of models can work and when the pre-trained models can introduce unseen errors. Second, we cannot remove the bad workers solely based on the prediction result. There is a false positive risk that we may remove the good workers due to the data inaccuracy caused by the “cross-browser compatibility” issue. So, I think the prediction model still belongs to the post-hoc class and it seems ineffective to me that instead of targeting at the work the workers done, we focus on study how each worker behave. However, I think the models proposed in the paper is useful when we handle the data generated by certain workers that fall onto the borderline of deciding whether the work is good or bad.

"MmmTurkey: A Crowdsourcing Framework for Deploying Tasks and Recording Worker Behavior on Amazon Mechanical Turk"

Note

MmmTurkey: A Crowdsourcing Framework for Deploying Tasks and Recording Worker Behavior on Amazon Mechanical Turk. Brandon Dang, Miles Hutson, and Matthew Lease. In 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP): Works-in-Progress Track, 2016. 3 pages. arXiv:1609.00945.

In the second paper, the authors propose a framework called MmmTurkey that is built on top of the Mechanical Turk, which allows the requesters to easily customize their tasks and at the same time logging the workers’ behaviors. I find the framework looks particularly promising especially for the requesters who want to deeply customize their HIT tasks while keeping track of the workers’ behaviors to perform quality control. I’m curious how exactly the framework works under the hood? Can it replace AMT completely in the sense that I can post the HIT tasks from MmmTurkey interface and wait for the result? I see the paper seems to agree with my thinking but I’m just wondering if there are any corner cases I need to be aware of?

"AMAZON'S TURKER CROWD HAS HAD ENOUGH"

Note

MIRANDA KATZ. AMAZON'S TURKER CROWD HAS HAD ENOUGH. WIRED: BACKCHANNEL. August 23, 2017.

In the first article, the author describes the MTurk and crowdsourcing in general from the workers’ perspectives. The idea is simple: the workers will deliver their best work given a good compensation and transparent communication. The message characterizes the future of the crowdsourcing industry in that MTurk is still has its advantages and the platform will dominate the industry if it can take much more of the workers and make the communication clearer. In addition, the article argues that simply replicating the MTurk with some minor additions to the functionalities will not sustain long in the industry. I agree with the author in that building a reliable platform has to be organization and community driven. This is not saying that academia effort like Daemo is worthless. Academia is very good at building innovative prototypes. However, in order for the platform to be scalable and robust, some large community or organization effort has to come in. From this perspective, MTurk has the clear edge over the other competitors.

"The Future of Work: Caring for the Crowdworker Going It Alone"

Note

The Future of Work: Caring for the Crowdworker Going It Alone (blog post). Mary Gray (Microsoft Research), Pacific Standard, August 21, 2015.

The second article shares the big picture with the previous one. On one hand, they focus on the traditional crowdsourcing platforms with microtasks. On the other hand, they expand the crowdsourcing term more broadly to the on-demand sharing economy, which includes Uber, Airbnb as well. This article specifically describes the future of the crowdsourcing platform, which should center around the workers. Compensation is not the only factor that the future crowdsourcing platforms should take care. Third-party registry that allows the workers build their resume and healthcare should be also valued just like the regular 40-hour employees.

"The Good, The Bad and the Ugly: Why Crowdsourcing Needs Ethics"

Note

The Good, The Bad and the Ugly: Why Crowdsourcing Needs Ethics. Schmidt, F. A. (2013, September). In Cloud and Green Computing (CGC), 2013 Third International Conference on (pp. 531-535). IEEE.

This paper reminds me of the HPU paper we read very early in the semester. HPU tends to think the crowd like machines, which can work like CPU. However, framing the crowd like this inevitably injecting the impression that the crowd can be treated cheaply. This impression is exactly what the paper tries to address. The paper categorizes the crowdsourcing behaviors based on the crowd’s incentives. Then, the paper studies some platforms and show how they try to exploit their workers’ incentives and treat them cheaply. These three readings make me think that we cannot treat crowdsourcing as a way to get the task done cheaply. We also need to take care of people that make all those great contributions to the advancement of civilization.

"Information Extraction and Manipulation Threats in Crowd-Powered Systems"

Note

Lasecki, Walter S., Jaime Teevan, and Ece Kamar. Information Extraction and Manipulation Threats in Crowd-Powered Systems. CSCW 2014.

In this paper, the authors talk about different forms of threats specifically on extracting information from the crowd-based systems and manipulating the systems’ outcomes. One example of information extraction is that the workers can extract the sensitive information (i.e., credit card number) from given picture for an image labeling task. Another example is that if the workers collaboratively mislabel the images, the machine learning algorithm based on the crowdsourced training data can lead to the wrong output. In the paper, the authors present several ways of preventing data leakage. One approach is the division of a task into micro-tasks. The idea is that we do not want each worker to see too much information. However, there can be a consequence of this approach is that the workers may get manipulated by the requesters to do the things that violate their will. One example mentioned by Caverlee is that Iran’s leaders use workers to cross-reference the faces of the citizens with those of photographed demonstrators. The authors further subdivide the information extraction threats into exposure, exploitation, and reconstruction. In addition, they classify the answer manipulation into classic manipulation, disruption, and corruption. Those threats are viewed from the requester's’ perspective. This paper makes me appreciate the importance of the quality control more and we need to pay extra attention to the sensitive information presented in the images. I think we somehow need to run the preliminary check of the data that are put onto the crowdsourcing platform.

"Exploitation in Human Computation Systems"

Note

Caverlee, James. Exploitation in Human Computation Systems. In Springer Handbook of Human Computation, pp. 837-845, 2013.

In this paper, the author surveys the exploitation of the human computation system from three perspectives: workers, requesters, and the system itself. There are many surprising points mentioned in the paper. For example, the division-and-conquer strategy of a task can manipulate workers’ will because each worker cannot see many contexts of the task they work on. In some cases, this can prevent workers leak the sensitive information from the task. For example, if we only allow each worker see three digits number and they cannot infer that they are actually working on the credit card information. However, as mentioned in the paper, people can help the government perform surveillance tasks that against workers’ will. In addition, I’m surprised to know how smart people can utilize the crowdsourcing platform. One example is to use the crowdsourcing platform to manipulate the political views. Another example is that workers can be hired and work collaboratively to manipulate certain task results which hurt their employer’s competitors. Another important issue is also related to the exploitation of the crowdsourcing system. For example, people can organize the workers to post the fake news on the social media to manipulate the public opinions. People spend a lot of effort on detecting the fake news and thanks to this paper, I start to think about how those fake news can be massively spread over different social media platforms in such a quick fashion.

"Dirty Deeds Done Dirt Cheap: A Darker Side to Crowdsourcing"

Note

Harris, Christopher G. Dirty Deeds Done Dirt Cheap: A Darker Side to Crowdsourcing. In Privacy, security, risk and trust (passat), 2011, pp. 1314-1317. IEEE.

In this paper, the author talks about the potential that the crowdsourcing system can be used for the unethical purpose. One interesting issue first posed by the paper is how to define unethical. As shown in the paper, different demographic group views the unethical behavior differently. This poses the challenges of policing ethical behavior on the internet. However, I’m very interested to know more how the author might want to model the different unethical behaviors. Another interesting point made by the author is the mention of social engineering. The motivation for social engineering mentioned in the paper is financial gains and identity theft. However, I think this is a big issue especially if we consider the government can utilize the crowdsourcing platform to perform surveillance against the nation’s citizens. This matters people’s privacy and human rights. That leads to another question: How can we prevent the crowdsourcing platform being used to hurt people’s rights. One suggestion made in the paper is by law by stating certain crowdsourcing behavior illegal. However, I think the platform needs to ban out certain tasks to be performed and put those forbidden tasks in the user’s agreement. At the same time, certain supervising needs to perform by the staff members from the platform.

Introduction to Conditional Random Fields

2017-09-22T10:20:00+08:00

Note

This is a repost of the Edwin Chen's blog: Introduction to Conditional Random Fields in year 2012. I do this based on three purposes: 1. Bookmark for my course project reference 2. Fix the math rendering issue happened to the original post 3. Small tweaks to the layout of math and notations to easy my understanding.

Imagine you have a sequence of snapshots from a day in Justin Bieber’s life, and you want to label each image with the activity it represents (eating, sleeping, driving, etc.). How can you do this?

One way is to ignore the sequential nature of the snapshots, and build a per-image classifier. For example, given a month’s worth of labeled snapshots, you might learn that dark images taken at 6am tend to be about sleeping, images with lots of bright colors tend to be about dancing, images of cars are about driving, and so on.

By ignoring this sequential aspect, however, you lose a lot of information. For example, what happens if you see a close-up picture of a mouth – is it about singing or eating? If you know that the previous image is a picture of Justin Bieber eating or cooking, then it’s more likely this picture is about eating; if, however, the previous image contains Justin Bieber singing or dancing, then this one probably shows him singing as well.

Thus, to increase the accuracy of our labeler, we should incorporate the labels of nearby photos, and this is precisely what a conditional random field does.

Part-of-Speech Tagging

Let’s go into some more detail, using the more common example of part-of-speech tagging.

In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.

For example, given the sentence “Bob drank coffee at Starbucks”, the labeling might be “Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.

So let’s build a conditional random field to label sentences with their parts of speech. Just like any classifier, we’ll first need to decide on a set of feature functions $f_i$.

Feature Functions in a CRF

In a CRF, each feature function is a function that takes in :

a sentence s
the position i of a word in the sentence
the label $l_i$ of the current word
the label $l_{i-1}$ of the previous word

as input and outputs a real-valued number (though the numbers are often just either 0 or 1).

(Note: by restricting our features to depend on only the current and previous labels, rather than arbitrary labels throughout the sentence, I’m actually building the special case of a linear-chain CRF. For simplicity, I’m going to ignore general CRFs in this post.)

For example, one possible feature function could measure how much we suspect that the current word should be labeled as an adjective given that the previous word is “very”.

Features to Probabilities

Next, assign each feature function $f_j$ a weight $\lambda_j$ (I’ll talk below about how to learn these weights from the data). Given a sentence s, we can now score a labeling l of s by adding up the weighted features over all words in the sentence:

$$ \text{score}(l | s) = \sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l_i, l_{i-1}) $$

(The first sum runs over each feature function $j$, and the inner sum runs over each position $i$ of the sentence.)

Finally, we can transform these scores into probabilities $p(l | s)$ between 0 and 1 by exponentiating and normalizing:

$$ p(l | s) = \frac{exp[\text{score}(l|s)]}{\sum_{l’} exp[\text{score}(l’|s)]} = \frac{exp[\sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l_i, l_{i-1})]}{\sum_{l’} exp[\sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l’_i, l’_{i-1})]} $$

Example Feature Functions

So what do these feature functions look like? Examples of POS tagging features could include:

$f_1(s, i, l_i, l_{i-1}) = 1$ if $l_i =$ ADVERB and the $i$th word ends in “-ly”; 0 otherwise.
- If the weight $\lambda_1$ associated with this feature is large and positive, then this feature is essentially saying that we prefer labelings where words ending in -ly get labeled as ADVERB.
$f_2(s, i, l_i, l_{i-1}) = 1$ if $i = 1$, $l_i =$ VERB, and the sentence ends in a question mark; 0 otherwise.
- Again, if the weight $\lambda_2$ associated with this feature is large and positive, then labelings that assign VERB to the first word in a question (e.g., “Is this a sentence beginning with a verb?”) are preferred.
$f_3(s, i, l_i, l_{i-1}) = 1$ if $l_{i-1} =$ ADJECTIVE and $l_i =$ NOUN; 0 otherwise.
- Again, a positive weight for this feature means that adjectives tend to be followed by nouns.
$f_4(s, i, l_i, l_{i-1}) = 1$ if $l_{i-1} =$ PREPOSITION and $l_i =$ PREPOSITION.
- A negative weight $\lambda_4$ for this function would mean that prepositions don’t tend to follow prepositions, so we should avoid labelings where this happens.

And that’s it! To sum up: to build a conditional random field, you just define a bunch of feature functions (which can depend on the entire sentence, a current position, and nearby labels), assign them weights, and add them all together, transforming at the end to a probability if necessary.

Now let’s step back and compare CRFs to some other common machine learning techniques.

Smells like Logistic Regression…

The form of the CRF probabilities $p(l | s) = \frac{exp[\sum_{j = 1}^m \sum_{i = 1}^n f_j(s, i, l_i, l_{i-1})]}{\sum_{l’} exp[\sum_{j = 1}^m \sum_{i = 1}^n f_j(s, i, l’_i, l’_{i-1})]}$ might look familiar.

That’s because CRFs are indeed basically the sequential version of logistic regression: whereas logistic regression is a log-linear model for classification, CRFs are a log-linear model for sequential labels.

Looks like HMMs…

Recall that Hidden Markov Models are another model for part-of-speech tagging (and sequential labeling in general). Whereas CRFs throw any bunch of functions together to get a label score, HMMs take a generative approach to labeling, defining

$$ p(l,s) = p(l_1) \prod_i p(l_i | l_{i-1}) p(w_i | l_i) $$

where

$p(l_i | l_{i-1})$ are transition probabilities (e.g., the probability that a preposition is followed by a noun);
$p(w_i | l_i)$ are emission probabilities (e.g., the probability that a noun emits the word “dad”). Notice $w_i$ means the word $i$ in a sentence.

So how do HMMs compare to CRFs? CRFs are more powerful – they can model everything HMMs can and more. One way of seeing this is as follows.

Note that the log of the HMM probability is $\log p(l,s) = \log p(l_0) + \sum_i \log p(l_i | l_{i-1}) + \sum_i \log p(w_i | l_i)$. This has exactly the log-linear form of a CRF if we consider these log-probabilities to be the weights associated to binary transition and emission indicator features.

That is, we can build a CRF equivalent to any HMM by…

For each HMM transition probability $p(l_i = y | l_{i-1} = x)$, define a set of CRF transition features of the form $f_{x,y}(s, i, l_i, l_{i-1}) = 1$ if $l_i = y$ and $l_{i-1} = x$. Give each feature a weight of $w_{x,y} = \log p(l_i = y | l_{i-1} = x)$.
Similarly, for each HMM emission probability $p(w_i = z | l_{i} = x)$, define a set of CRF emission features of the form $g_{x,y}(s, i, l_i, l_{i-1}) = 1$ if $w_i = z$ and $l_i = x$. Give each feature a weight of $w_{x,z} = \log p(w_i = z | l_i = x)$. Again, $w_i$ represents the word $i$.

Thus, the score $p(l|s)$ computed by a CRF using these feature functions is precisely proportional to the score computed by the associated HMM, and so every HMM is equivalent to some CRF.

However, CRFs can model a much richer set of label distributions as well, for two main reasons:

CRFs can define a much larger set of features. Whereas HMMs are necessarily local in nature (because they’re constrained to binary transition and emission feature functions, which force each word to depend only on the current label and each label to depend only on the previous label), CRFs can use more global features. For example, one of the features in our POS tagger above increased the probability of labelings that tagged the first word of a sentence as a VERB if the end of the sentence contained a question mark.
CRFs can have arbitrary weights. Whereas the probabilities of an HMM must satisfy certain constraints (e.g., $0 <= p(w_i | l_i) <= 1, \sum_w p(w_i = w | l_1) = 1)$, the weights of a CRF are unrestricted (e.g., $\log p(w_i | l_i)$ can be anything it wants).

Learning Weights

Let’s go back to the question of how to learn the feature weights in a CRF. One way is (surprise) to use gradient ascent.

Assume we have a bunch of training examples (sentences and associated part-of-speech labels). Randomly initialize the weights of our CRF model. To shift these randomly initialized weights to the correct ones, for each training example…

Go through each feature function $f_i$, and calculate the gradient of the log probability of the training example with respect to $\lambda_i$: $\frac{\partial}{\partial w_j} \log p(l | s) = \sum_{j = 1}^m f_i(s, j, l_j, l_{j-1}) - \sum_{l’} p(l’ | s) \sum_{j = 1}^m f_i(s, j, l’_j, l’_{j-1})$
Note that the first term in the gradient is the contribution of feature $f_i$ under the true label, and the second term in the gradient is the expected contribution of feature $f_i$ under the current model. This is exactly the form you’d expect gradient ascent to take.
Move $\lambda_i$ in the direction of the gradient: $\lambda_i = \lambda_i + \alpha [\sum_{j = 1}^m f_i(s, j, l_j, l_{j-1}) - \sum_{l’} p(l’ | s) \sum_{j = 1}^m f_i(s, j, l’_j, l’_{j-1})]$ where $\alpha$ is some learning rate.
Repeat the previous steps until some stopping condition is reached (e.g., the updates fall below some threshold).

In other words, every step takes the difference between what we want the model to learn and the model’s current state, and moves $\lambda_i$ in the direction of this difference.

Finding the Optimal Labeling

Suppose we’ve trained our CRF model, and now a new sentence comes in. How do we do label it?

The naive way is to calculate $p(l | s)$ for every possible labeling l, and then choose the label that maximizes this probability. However, since there are $k^m$ possible labels for a tag set of size k and a sentence of length m, this approach would have to check an exponential number of labels.

A better way is to realize that (linear-chain) CRFs satisfy an optimal substructure property that allows us to use a (polynomial-time) dynamic programming algorithm to find the optimal label, similar to the Viterbi algorithm for HMMs.

Watching log of CMU Database Systems course

2017-09-08T12:30:00+08:00

Motivation

There is a hobby I always want to develop but never gets into practice: watch lecture videos while I eat. The reason because lecture videos are mostly not fun especially when the content of the video is entirely new to you. However, this semester I want to actually start developing this habbit partly because I'm missing system side of computer science. I miss the database knowledge I have picked up in the past three years and I don't want to lose the touch in this field. So, I think why not start to watch database lecture videoes for fun when I eat? That leads to this post.

This post is a log of cool points I like when I watch CMU Database Group Database Systems lecture video.

Log

--- 09/07/2017 UPDATE ---

There are bunch of data models besides relational model: relational, key/value, graph, document, column-family, array/matrix, hierarchical, network
Thanks to Prof.Andy Pavlo, I finally understand the difference between relational algebra and relational calculus in terms of their purpose:

When we talk about using data manipulation language (DML) to store and retrieve information from a database, there are two categories: procedural and non-procedural, which corresponds to relational algebra and relational calculus respectively. For procedural language, the query specifies the (high-level) strategy the DBMS should use to find the desired result. For non-procedural lanaguages, the query specifies only what data is wanted and not how to find it. In fact, SQL is derived from relational calculus. In other words, relational calculus is used when we try to come up with a different query language to replace SQL.
The fundamental operators in relational algebra need to be implemented in the database system in order to manipulate tuples: $\sigma \text{(select)}, \pi \text{(projection)}, \cup \text{(union)}, \cap \text{(intersection)}, - \text{(difference)}, \times \text{(product)}, \bowtie \text{(join)}$.

--- 11/14/2018 UPDATE ---

Lock vs. Latch in database context:
- Lock is a high-level primitive on a logical component of a database: a lock on a database, a lock on a table, a lock on a record
- Latch is the lock from OS perspective (a low-level primitive that works on the data structure): a latch on the page table, a latch on a index page

MAW Chapter 8: Disjoint set

2017-08-29T01:12:00+08:00

Disjoint Set ADT is an efficient data structure to solve the equivalence problems. It has wide applications: Kruskal's minimum spanning tree algorithm, Least common ancestor, compiling equivalence statements in Fortran, Matlab's bwlabel() function in image processing, and so on. In this post, I'll walk through this data structure in order to have the better preparation for the graph algorithms in the following chapter of MAW.

Equivalence relations
The dynamic equivalence problem
Quick-find
Quick-union
Improvements
- Smart union (weighted quick-union)
- Path compressionn
Remarks
Links to resources

Note

The implementation of the quick-find, quick-union, smart-union with path compression in C can be seen here and an application to solve problem in C++

Equivalence relations

In order to better describe the dynamic equivalence problem, we need to first talk about the concept equivalence relation. A relation $R$ is defined on a set $S$ if for every pair of elements $(a,b)$, $a,b \in S$, $a$ $R$ $b$ is either true or false. If $a$ $R$ $b$ is true, then we say that $a$ is related to $b$. An equivalence relation is a relation $R$ that satisfies three properties:

(Reflective) $a$ $R$ $a$, for all $a \in S$.
(Symmetric) $a$ $R$ $b$ iff $b$ $R$ $a$.
(Transitive) $a$ $R$ $b$ and $b$ $R$ $c$ implies that $a$ $R$ $c$.

Usually, we use $\sim$ to denote equivalence relation. Let's consider several examples:

The $\le$ relationship is NOT an equivalence relationship. Although it is reflexive (i.e., $a \le a$) and transitive (i.e., $a \le b$ and $b \le c$ implies $a \le c$), it is not symmetric, since $a \le b$ does not imply $b \le a$.
Electrical connectivity, where all connections are by metal wires, is an equivalence relation. The relation is clearly reflexive, as any component is connected to itself. If $a$ is electrically connected to $b$, then $b$ must be electrically connected to $a$, so the relation is symmetric. Finally, if $a$ is connected to $b$ and $b$ is connected to $c$, then $a$ is connected to $c$.
Two cities are related if they are in the same country. This is an equivalence relation.
Suppose town $a$ is related to $b$ if it is possible to travel from $a$ to $b$ by taking roads. This relation is an equivalence relation if all the roads are two-way.

We need to define another term equivalence class in order to talk about dynamic equivalence problem. Suppose we are given a set of elements that have the equivalence relation defined over (i.e. for a set $\{a_1,a_2,a_3\}$, we have $a_1 \sim a_2$), the equivalence class of an element $a \in S$ is the subset of $S$ that contains all the elements that are related to $a$. Notice that the equivalence classes form a partition of $S$: every member of $S$ appears in exactly one equivalence class.

The dynamic equivalence problem

The dynamic equivalence problem essentially is about supporting two operations on a set of elements where the equivalence relation is defined over:

find, which returns the name of the set (i.e., the equivalence class) containing a given element.
union, which merges the two equivalence classes containing $a$ and $b$ into a new equivalence class. From a set point of view, the result of union is to create a new set $S_k = S_i \cup S_j$, destroying the originals and preserving the disjointness of all the sets.

We can model the problem like the following: the input is initally a collection of $N$ sets, each with one element. This initial representation is that all relations(except reflexive relations) are false. Each set has a different element, so that $S_i \cap S_j = \emptyset$; this makes the sets disjoint. In addition, since we only care about the knowledge of the elements' locations not values, we can assume that all the elements have been numbered sequentially from $1$ to $N$. Thus, we have $S_i = \{i\}$ for $i = 1$ through $N$. At last, we don't care what value returned by find operation as long as find(a) = find(b) iff $a$ and $b$ are in the same set.

Now, let's take a look at an example. Suppose we have a set of $10$ elements: $\{0,1,2,3,4,5,6,7,8,9\}$ and we perform the following union operations: $1 - 2, 3-4, 5-6, 7-8, 7-9, 2-8, 0-5, 1-9$. Then, we have three connected components (i.e. maximal set of objects that are mutually connected): $\{0,5,6\}, \{3,4\}, \{1,2,7,8,9\}$. find(5) should return the same value as find(6).

Quick-find

The first approach to solve the problem is called quick-find, which ensures that the find instruction can be executed in constant worst-case time. For the find operation to be fast, we could maintain, in an array, the name of the equivalence class for each element. Then find is just a simple $O(1)$ lookup:

In the above example, find(0) gives $0$; find(1) gives $1$; find(5) gives $0$. Thus, we know that $0 \sim 5$, $0 \nsim 1$, and $1 \nsim 5$. For the union(a,b) operation, suppose that $a$ is in equivalence class $i$ and $b$ is in equivalence class $j$. Then we scan down the array, changing all $i$'s to $j$.

In the above example, when do union(6,1), we need to change all entries in the equivalence class of $6$ (i.e., $0,5,6$) into $1$'s. As you can see, the number of array acesses for union operation is $O(N)$. Thus, a sequence of $N-1$ union (the maximum, then everything is in one set) would take $O(N^2)$ time.

Quick-union

The second approach to solve the problem is to ensure that the union instruction can be executed in constant worst-case time, which is called "quick-union". One thing to note is that both find and union cannot be done simultaneously in constant worst-case time. Recall that the problem doesn't require that a find operation return any specific name as long as find on the elements in the same connected component returns the same value. Thus, we can use a tree to represent each component becase each element in a tree has the same root. Thus, the root can be used to name the set. The structure looks like below:

Since only the name of the parent is required, we can assume that this tree is stored implicitly in an array: each entry $\text{id}[i]$ in the array represents the parent of element $i$. If $i$ is the root, then $\text{id}[i] = i$. A find(X) on element $X$ is performed by returning the root of the tree containing $X$. The time to perform this operation depending on the depth of the tree that represents the set containing $X$, which is $O(N)$ in the worst case because of the possiblity of creating a tree of depth $N-1$. union(p,q) can be done by change the root of tree containing $p$ into the value of root containing $q$:

Changing the root value step in union(p,q) is $O(1)$. However, since we need to find the root of $p$ and $q$ respectively, which takes $O(N)$ in the worst case. Thus, the union operation takes $O(N)$.

Improvements

There are two major improvements we can do with our quick-union: smart-union works on union operation and path compression works on find operation. Their goal is to make the tree of each set shallow, which can reduce the time we spend on find.

Smart union (weighted quick-union)

Smart union is a modification to quick-union that avoid tall trees. We keep track of the size (i.e., number of objects) of each tree and always to link the root of smaller tree to root of larger tree, breaking ties by any method. This approach is called union-by-size. In quick-union, we may make the larger tree a subtree of the smaller tree, which increase the depth of the new tree, which increase the find cost. The following picture demonstrates this point:

Another approach is called union-by-height, which tracks the height, instead of the size, of each tree and perform union by making the shallow tree a subtree of the deeper tree. Since the height of a tree increases only when two equally deep trees are joined (and then the height goes up by one). Thus, union-by-height is a trivial modification of union-by-size.

To find the running time of find and union, we need to find out the depth of any node $X$, which in this case is at most $\log N$. The proof is simple: when the depth of $X$ increases, the size of tree is at least doubled (i.e., join two equal-size trees). Since there are at maximum $N$ nodes for a tree, the size of trees doubled at least $\log N$ times. Thus, the depth of any node is at most $\log N$. With this claim, we have running time for find is $O(\log N)$ and running time for union is $O(\log N)$ as well.

Path compressionn

Path compression is performed during a find operation and is independent of the strategy used to perform union. The effect of path compression is that every node on the path from $X$ to the root has its parent changed to the root. For example, suppose we call find(9) for the following tree representation of our disjoint set:

Then the following picture shows the end state of our tree after calling find(9). As you can see, on the path from $9$ to $0$ (root), we have $9, 6, 3, 1$. All of them have been directly connected to the root after the call is done:

This strategy may look familiar to you: we do the path compression in the hope of the fast future accesses on these nodes (i.e., $9,6,3,1$) will pay off for the work we do now. This idea is exactly the same as the splaying in splay tree.

When union are done arbitrarily, path compression is a good idea, because there is an abundance of deep nodes and these are brought near the root by path compression. Path compression is perfectly compatible with union-by-size, and thus both routines can be implemented at the same time. In fact, the combination of path compression and a smart union rule guarantees a very efficient algorithm in all cases. Path compression is not entirely compatible with union-by-height, because path compression can change the heights of the trees. We don't want to recompute all the heights and in this case, heights stored for each tree become estimated heights (i.e., ranks), but in theory union-by-rank is as efficient as union-by-size.

If we do analysis on smart union with path compression, the running time for any sequence of $M$ union-find operations on $N$ objects makes $O(N + M\log^*N)$ ¹ accesses.

The following table summarizes the running time for $M$ union-find operations on a set of $N$ objects (don't forget we need to spend $O(N)$ to initialize disjoint sets):

algorithm	worst-case time
quick-find	$MN$
quick-union	$MN$
smart union	$N + M\log N$
quick union + path compression	$N + M\log N$
smart union + path compression	$N + M\log^*N$

The running time for each operation for each algorithm is following:

algorithm	initialize	union	find
quick-find	N	N	1
quick-union	N	N	N
smart union	N	$\log N$	$\log N$
quick union + path compression	N	$\log N$	$\log N$
smart union + path compression	N	$\log^*N$	$\log^*N$

Remarks

Sedgewick slide offers view that may be helpful in modeling the problems using the union-find data structure. Essentially, union-find structure addresses the "dynamic connectivity problem":

Given a set of N objects, support two operation: 1. Connect two objects. 2. Is there a path connecting the two objects?

For example, given two points in a maze, we may ask "Is there a path connecting $p$ and $q$?" Objects can be:

Pixels in a digital photo.
Computers in a network.
Friends in a social network.
Transistors in a computer chip.
Elements in a mathematical set.
Variable names in a Fortran program.
Metallic sites in a composite system.

Segewick gives a list of union-find applications:

Percolation.
Games (Go, Hex).
Dynamic connectivity.
Least common ancestor.
Equivalence of finite state automata.
Hoshen-Kopelman algorithm in physics.
Hinley-Milner polymorphic type inference.
Kruskal's minimum spanning tree algorithm.
Compiling equivalence statements in Fortran.
Morphological attribute openings and closings.
Matlab's bwlabel() function in image processing.

Links to resources

Here are some of the resources I found helpful while preparing this article:

M. A. Weiss, Data Structures and Algorithm Analysis in C. (2nd ed.) Menlo Park, Calif: Addison-Wesley, 1997, ch. 8.
R. Sedgewick 1946 and K. Wayne 1971, algorithms. (4th ed.) Upper Saddle River, NJ: Addison-Wesley, 2011, ch. 1, sec. 5.

$\log^* N$ counts the number of times you have to take the $\log$ of $N$ to get one. This is also called iterated log function. For example, $\log^* 65536 = 4$ because $\log\log\log\log65536 = 1$. ↩

Knapsack problem

2017-08-14T16:45:00+08:00

I'm studying this problem on my way from Beijing to Austin to kill some time. This problem works pretty well for this purpose. There are three kinds of forms for the problem, which I'll illustrate below. This problem is a good example of dynamic programming paradigm.

Knapsack problem
A breif revisit to dynamic programming
0-1 knapsack problem
0-x knapsack problem
Fractional knapsack problem
Links to resources

Knapsack problem

Suppose we are planning a hiking trip; and we are, therefore, interested in filling a knapsack with items that are considered necessary for the trip. There are $N$ different item types that are deemed desirable; these could include bottle of water, apple, orange, sandwich, and so forth. Each item type has a given set of two attributes, namely a weight $W$ and a value that quantifies the level of importance associated with each unit of that type of item. Since the knapsack has a limited weight capacity, the problem of interest is to figure out how to load the knapsack with a combination of units of the specified types of items that yields the greatest total value.

A large variety of resource allocation problems can be cast in the framework of a knapsack problem. The general idea is to think of the capacity of the knapsack as the available amount of a resource and the item types as activities to which this resource can be allocated. Two quick examples are the allocation of an advertising budget to the promotions of individual products and the allocation of your effort to the preparation of final exams in different subjects.

A breif revisit to dynamic programming

Dynamic programming is a algorithm design strategy when a problem can be broken down into recurring small problems. Specifically, it is used when the solution can be recursively described in terms of solutions to subproblems and algorithms find solutions to subproblems and store them for later use. This is better than "brute-force methods", which solves the same subproblem over and over again. It is different from divide-and-conquer paradigm in the sense that divide-and-conquer divides the problem into independent subproblems and solve them individually, and then combine them to form the final solution. There is no re-use of a solution to one subproblem in order to solve another subproblem, like dynamic programming. From this perspective, dynamic programm is more like recursion + re-use ¹.

The general steps to use dynamic programming paradigm is that:

Define the subproblems
Define the solution to subproblems, which can be reused by other subproblems (if you think of recursion, the solution to subproblem can be re-used by the subproblem one function call earlier).
Solve the problem bottom-up from "basic cases", building a table of solved subproblems that are used to solve larger ones

0-1 knapsack problem

The first type of knapsack program has the following restriction on how the item should be picked: items are not divisible. In other words, you either take an item or not.

Let's first formulate this problem mathematically. Given a knapsack with maximum capacity $W$, and a set $S$ consisting of $n$ items. Each item $i$ has some weight $w_i$ and benefit value $b_i$ (all $w_i$, $b_i$ and $W$ are integer values). The problem we try to solve is how to pack the knapsack to achieve maximum total value of packed items? In other words, we try to find $\text{max}\sum_{i \in S} b_i \text{ subject to } \sum_{i \in S} w_i \le W$.

The first step in dynamic programming is to define the subproblems. In this problem, we can try to label the items from $1 \dots n$, and then a subproblem is to find an optimal solution for $S_i = \left\lbrace \text{items labeled } 1,2,\dots, i \right\rbrace$. Once we have a definition for the subproblem, we need to ask: can we describe the final solution (i.e., $S_n$) in terms of subproblems (i.e., $S_i$)? Let's take a look at an example:

Item	1	2	3	4	5
Weight $w_i$	2	3	4	5	9
Benefit $b_i$	3	4	5	8	10

Suppose $W = 20$. Then for $S_4$, the optimal solution is to put item $1,2,3,4$ into knapsack because the total weight is $2 + 3 + 4 + 5 = 14$, which is less than $20$ and we have the maximum benefit: $20$. For $S_5$, the optimal solution is to put item $1,3,4,5$ into knapsack with the total weight $2 + 4 + 5 + 9 = 20$ and maximum benefit $26$. However, as you can see, the solution for $S_4$ (i.e., $1,2,3,4$) is not part of the solution for $S_5$ (i.e., $1,3,4,5$). Thus, our definition of subproblem is not right. Let's refine our subproblem definition by adding another parameter $w$ that represents the exact weight for each subset of items. Then, our subproblem is to find the best subset of $S_i$ that has total weight $w$. The benefits corresponding with the best subset is denoted as $B[i, w]$.

Note

The problem with the first subproblem definition is that we kind of play greedy algorithm in the sense that we always choose the optimal solution for $S_i$ given $W$. However, the problem with greedy approach is that the optimal solution for $S_i$ may not be the part of solution for $S_n$. The refinement to that is that we allow to tweak $W$ to reflect the number of items we are considering. This idea will become clear once we walk through the solution to the problem. All in all, coming up a good subproblem definition is the key in using dynammic programming, which requires many practice.

Once we have the subproblem definition, we can define our solution to the subproblem. In this case, our recursive formula for subproblems look like below:

$$ B[i,w]=\left\{ \begin{array}{ll} B[i-1,w] \text{ if } w_i > w \\ \text{max}\{B[i-1, w], B[i-1, w-w_i] + b_i\} \text{ otherwise } \end{array} \right. $$

The idea for above recursion formula is that the best subset of $S_i$ that has total weight $w$ either contains item $i$ or not. First case: $w_i > w$. Item $i$ can't be part of the solution because including item $i$ will have the total weight greater than $w$, which is unacceptable. The second case: $w_i \le w$. Item $i$ can be in the solution, and we choose the case with greater value: not contain $i$ (i.e., $B[i-1, w]$) or contain $i$ (i.e., $B[i-1, w-w_i] + b_i$).

Note

As you can see from recursive formula, the solution $B[i,w]$ reuses the solution to $B[i-1, w]$ and $B[i-1, w-w_i]$, which is the signature of dynamic programming.

The algorithm is below:

for w = 0 to W 
    B[0,w] = 0
for i = 1 to n 
    B[i,0] = 0 
for i = 1 to n
    for w = 1 to W
        if w_i <=w  //item i can be part of the solution 
            if b_i + B[i-1,w-w_i ] > B[i-1,w]
                B[i,w] = b_i + B[i-1, w-w_i]
            else
                B[i,w] = B[i-1,w]
        else 
            B[i,w] = B[i-1,w] // w_i > w

The running time for this algorithm is $O(n*W)$ because of the double for loop. To understand this algorithm, let's take a look at an example:

Item	1	2	3	4
Weight $w_i$	2	3	4	5
Benefit $b_i$	3	4	5	6

Suppose $W = 5$ and with algorithm above, we essentially fill out a table like following:

i \ w	1	2	3	4	5
0	0	0	0	0	0
1
2
3
4

Each cell in the above table represents $B[i,w]$ and the table looks like above after we executing the first two for loops

for w = 0 to W 
    B[0,w] = 0
for i = 1 to n 
    B[i,0] = 0

Now, let's take a look at $B[1,1]$. We have $i = 1, b_i = 3, w_i = 2, w = 1$, and since $w_i > w$, and by our algorithm, we have $B[i,w] = B[i-1,w]$, which is $B[0,1]$. Thus, $B[1,1] = 0$ and the table becomes

i \ w	1	2	3	4	5
0	0	0	0	0	0
1	0
2
3
4

We can fill out the table like this and have

i \ w	2	3	4	5
0	0	0	0	0
1	3	3	3	3
2	3	4	4	7
3	3	4	5	7
4	3	4	5	7

Now, we have the maximum benefit value given the capacity of the knapsack, which is $B[4,5] = 7$. The next question is to find out what items we should put inside the knapsack in order to have this maximum benefit (i.e., $7$). The algorithm is below:

Let i = n and k = W
while i > 0 and k > 0
    if B[i,k] != B[i-1,k], then
        mark the ith item as in the knapsack
        i = i - 1
        k = k - w_i
    else
        i = i - 1

Let's first take a look a few examples on how this algorithm runs and then we will see the intuition behind this algorithm. In our example, we start with $i = 4, k = 5, w_i = 5, B[i,k] = 7, B[i-1,k] = 7$. Then, by our algorithm, we decrements $i$ and move on. Same situtation happens when $i=3$ because $B[i,k] = B[i-1,k]$. Now, when $i = 2$, we have $B[i,k] = 7$ and $B[i-1,k] = 3$. Thus, we mark $i=2$th item as in the knapsack and this item has weight $w_2 = 3$ and $b_2 = 4$. Now, we decrement $i$ and update $k$ by $k = k - w_i = 5 - 4 = 3$. Now, we are at $B[1,2]$ cell. We do the same for $i = 1$ and it turns out $i = 1$ should also be inside the knapsack.

You may now probably tell that the idea for the algorithm to find the exactly items inside the knapsack is that we always want to put the item inside the knapsack that has value gain. This is just like "marginal thinking" in Economics. For example, we don't want to put $i = 4$th item inside knapsack because from $i = 3$ to $i = 4$, there is no value gain (i.e. $B[i,k] = B[i-1,k] = 5$).

Note

The detailed implementation of the algorithm in C can be found from my code-for-blog repo.

0-x knapsack problem

0-x knapsack problem is a generalization of 0-1 knapsack problem in the sense that for item $i$ we can load multiple of it into our knapsack. Let's use $x_i$ to denote the number of $i$th item that is loaded into the knapsack. One requirement to $x_i$ is that it has to be integer-valued.

The problem formulation and subproblem definition is just like the 0-1 knapsack problem. The only thing we need to do is to adjust is the recursive solution to subproblems:

$$ B[i,w]=\left\{ \begin{array}{ll} B[i-1,w] \text{ if } w_i > w \\ \max_{0 \le x_i \le \lfloor\frac{w}{w_i}\rfloor}\{B[i-1, w], B[i-1, w-w_ix_i] + b_ix_i\} \text{ otherwise } \end{array} \right. $$

When $w_i > w$, we cannot include item $i$ at all. However, when $w_i < w$, we have more options because we can include multiple of item $i$. But, we cannot assign a value greater than $w/w_i$ to $x_i$. In addition, $w/w_i$ may not be an integer. So, we $x_i$ should be in the range $0 \le x_i \le \lfloor w/w_i \rfloor$, where the notation $\lfloor x \rfloor$ is, for any given $x$, defined as the greatest integer less than or equal to $x$. The algorithm is shown below:

for w = 0 to W
    B[0,w] = 0
for i = 1 to n
    B[i,0] = 0
for i = 1 to n
    for w = 1 to W
        if w_i <= w
            cands_B = a list of (w/x_i+1) items // hold the possible B[i,w] value for each x_i
            for x_i = 0 to w/x_i
                cands_B[x_i] = B[i-1, w - w_i * x_i] + b_i * x_i
                B[i,w] = max(cands_B)
                x[i,w] = x_i that is associated with B[i,w]
        else
            B[i,w] = B[i-1,w]
            x[i,w] = x_i that is associated with B[i-1,w]

Let's walk through an example to better understand the algorithm idea:

Item	1	2	3
Weight $w_i$	3	8	5
Benefit $b_i$	4	6	5

$W = 8$ in our example. The algorithm essentially tries to build fill out two tables $B[i,w]$ table like the 0-1 knapsack problem:

i \ w	3	4	5	6	7	8
0	0	0	0	0	0	0
1	4	4	4	8	8	8
2	4	4	4	8	8	8
3	4	4	5	8	8	9

and $x[i,w]$ table to keep track of the optimal value of $x_i$ for each $i$ and $w$

i \ w	3	4	5	6	7	8
1	1	1	1	2	2	2
2	0	0	0	0	0	0
3	0	0	1	0	0	1

For $i = 1$, we have $w_i = 3, b_i = 4$. Thus, when $w = 0,1,2$, we have $B[i,w] = 0$ and $x[i,w] = 0$ as well. When $w = 3$, we can include $w_i$ inside the knapsack and we calculate the following:

$$ B[1,3] = \max \{B[0,3], B[0,3-3*1]+4*1\} = \max \{0, 0+4\} = 4 $$

So, we fill in $B[1,3] = 4$ and $x[1,3] = 1$. We can repeatedly doing this for other $i$s. We can take a look at the values from another perspective like the tables below:

B[1,w]	w	x_1*
0	0	0
0	1	0
0	2	0
4	3	1
4	4	1
4	5	1
8	6	2
8	7	2
8	8	2

B[2,w]	w	x_2*
0	0	0
0	1	0
0	2	0
4	3	0
4	4	0
4	5	0
8	6	0
8	7	0
8	8	0

B[3,w]	w	x_3*
0	0	0
0	1	0
0	2	0
4	3	0
4	4	0
5	5	1
8	6	0
8	7	0
9	8	1

The values are all the same but the data are organized differently. Clearly, two tables are better than three tables in terms of space. As you can see from either two sets of tables, the maximum benefit we can achieve is $9$, which is entry of $B[3,8]$. In order to find out the exact composition of the knapsack, we can work backwards from $i = 3$. With $i = 3$ and $w = 8$, $x_1 = 1$. Since $w_1 = 5$ and $W = 8$, then we have $3$ left in knapsack in terms of weight. For $i = 2$ and $w = 3$ (because we have only $3$ left in our knapsack weight capacity), $x_2 = 0$. Then, for $i = 1, w = 3$ (we still have $3$ budge left), $x_1 = 1$. Thus, our optimal choice for each item is $x_1 = 1, x_2 = 0, x_3 = 1$.

Fractional knapsack problem

Another type of knapsack problem is the fractional knapsack problem. In this setting, the item is divisible. In other words, we can take fraction of item. This problem can also be considered as a generalization of 0-x knapsack problem by not requiring $x_i$ has to be integer value. In this case, we actually use the greedy algorithm paradigm instead of dynamic programming paradigm to solve the problem.

Let's use the same example as 0-x knapsack problem. In this case, we actually calculate the benefit per unit of weight first:

$$ \frac{b_1}{w_1} = \frac{4}{3}, \frac{b_2}{w_2} = \frac{6}{8}, \frac{b_3}{w_3} = \frac{5}{5} $$

as you can see, the first item gives the maximum benefit per unit of weight and we want to load this item as much as possible, which is $W/w_1 = 8/3$. The reason we can use greedy algorithm to do this is because we can have fraction of item. If we don't allow the fraction of item like we do in the 0-x knapsack problem, this appraoch will get suboptimal solution: we can only have $2$ item $1$ (because $\lfloor\frac{8}{3}\rfloor = 2$) and the last $2$ knapsack weight capacity is not big enough to fit any item any more. Thus, the benefit we get in this case is $2*4 = 8$, which is smaller than $9$ our maximum benefit obtained from dynamic programming approach.

Links to resources

Here are some of the resources I found helpful while preparing this article:

See Difference between Divide and Conquer Algo and Dynamic Programming SO post ↩

Understanding how function call works

2017-07-30T00:21:00+08:00

Understanding assembly language is crucial for system programming. Some nasty defects of the system can only be solved by digging into the assembly level of the program. In this post, I'll revisit call stack concept as a way to understand how function call works under the cover of high-level language. In addition, this post belongs to part of future work mentioned in my post back in January.

Addressing mode
Main course
Future works
Links to resources

Addressing mode

Before we jump into the actual material. I want to briefly revisit the various ways for assembly language accessing the data in memory (i.e., addressing mode). The following table is adapted from CSAPP (2nd edition):

Type	Form	Operand Value	Name
Immediate	$$Imm$	$Imm$	Immediate
Register	$E_a$	$R[E_a]$	Register addressing
Memory	$Imm$	$M[Imm]$	Direct addressing
Memory	$(E_a)$	$M[R[E_b]]$	Indirect addressing
Memory	$Imm(E_b, E_i, s)$	$M[Imm+R[E_b]+(R[E_i]\cdot s)]$	Scaled indexed addressing ¹

In the above table,

$Imm$ refers to a constant value, e.g. $\mathtt{0x8048d8e}$ or $\mathtt{48}$
$E_x$ refers to a register, e.g. $\mathtt{\%eax}$
$R[E_x]$ refers to the value stored in register $E_x$
$M[x]$ refers to the value stored at memory address $x$

Main course

Now, let's bring our main course onto the table: understanding how function works. I'll first clear up some terms we will use during the explanation. Then, we'll take a look at the stack and understand how it supports function calls. Lastly, we'll examine two assembly programs and understand the whole picture of function calls.

Some terms

Let's first consider what the key elements we need in order to form a function:

function name

A function's name is a symbol that represents the address where the function's code starts.
function arguments

A function's arguments (aka. parameters) are the data items that are explicitly given to the function for processing. For example, in mathematics, there is a $\sin$ function. If you were to ask a computer to find the $\sin (2)$, $\sin$ would be the function's name, and $2$ would be the argument (or parameter).
local variables

Local variables are data storage that a function uses while processing that is thrown away when it returns. It's knid of like a scratch pad of paper. Functions get a new piece of paper every time they are activated, and they have to throw it away when they are finished processing.
return address

The return address is an "invisible" parameter in that it isn't directly used during the function. The return address is a parameter which tells the function where to resume executing after the function is completed. This is needed because functions can be called to do processing from many different parts of our program, and the function needs to be able to get back to wherever it was called from. In most programming languages, this parameter is passed automatically when the function is called. In assembly language, the call instruction handles passing the return address for you, and ret handles using that address to return back to where you called the function from.
return value

The return value is the main method of transferring data back to the main program. Most programming languages only allow a sinlge return value for function.

Note

The way that the variables are stored and the parameters and return values are transferred by the computer varies from language to language. This variance is known as a language's calling convention, because it describes how functions expect to get and receive data when they are called. In this post, I'll follow C programming language calling convention.

Stack

Each computer program that runs uses a region of memory called the stack to enable functions to work properly. Machine uses the stack to pass function arguments, to store return information, to save registers for later restoration, and for local variables. The portion of the stack allocated for a single function call is called a stack frame. In other words, for each function call, new space (i.e., stack frame) is created on the stack.

The computer's stack lives at the very top addresses of memory. As the name suggests, stack is a stack data structure with the "top" of the stack growing from the high value addresses towards low values addresses. We use $\mathtt{push \text{ } S}$ to push the source onto stack, and we use $\mathtt{pop \text{ } D}$ to remove the top value from the stack and place it into a destination (i.e. a register or memory location). We use the stack register, $\mathtt{\%esp}$ as a pointer to the top of the stack, which at the same time, is the top of topmost stack frame.

Note

Pointer here means that the stack register contains an address in memory instead of a regular value. Specifically, the stack register now contains the address, which has the top value of the stack in it. With this description, we can see that to access the value on the top of the stack without removing it, we can do (%esp), which is indirect addressing mode.

When we talk about function calls, what we really care about is the topmost stack frame because that's the memory region that is associated with our current function calls. CSAPP (2nd edition) has a nice picture about what the whole stack looks like:

If some texts (i.e. "Saved %ebp") or layout (i.e. the order of arguments) don't make sense to you, don't worry. I'll talk about them immediately.

Calling a function

Before executing a function, a program pushes all of the parameters for the function onto the stack in the reverse order that they are documented. Then the program issues a call instruction indicating which function it wishes to start. The call instruction does two things:

First it pushes the address of the next instruction, which is the return address, onto the stack.
Then, it modifies the instruction pointer $\mathtt{\%eip}$ to point to the start of the function.

Note

When you call a function, you should assume that everything currently in your registers will be wiped out. The only register that is guaranteed to be left with the value it started with is $\mathtt{\%ebp}$ (why? see "Writing a function" section below). Thus, if there are registers you want to save before calling a function, you need to save them by pushing them on the stack before pushing the function's parameters.

So, at the time the function starts, the stack looks like this:

Argument N
...
Argument 2
Argument 1
Return address <--- (%esp)

As noted previously, the stack pointer holds the address, which contains return address as its value.

Writing a function

Writing a function in x86 assembly essentially contains of three parts: setup, using the stack frame to perform task, cleanup. The setup and cleanup steps are the same acrossed all the function calls. All three steps will be explained in details.

Setup

During the setup, the following two instructions are carried out immediately:

pushl %ebp
movl %esp %ebp

The first instruction is to save the current base pointer register (aka frame pointer), $\mathtt{\%ebp}$. The base pointer is a special register used for accessing function parameters and local variables.The stack frame is delimited by two pointers: $\mathtt{\%ebp}$ serves as the pointer pointing to the bottom of the stack frame and $\mathtt{\%esp}$ serves as the pointer pointing to the top of the stack frame. As pointed out earlier, each function call has its own stack frame. Once the current function (i.e. callee) is done, we need to resume the execution of the caller function. This means that we need to restore the caller's base pointer register $\mathtt{\%ebp}$ when we are done with callee function. Thus, we need to save the current base pointer register, which is the caller's for the future caller stack frame restoration.

Once we are done with saving the caller's $\mathtt{\%ebp}$, we can now setup current stack frame's $\mathtt{\%ebp}$ by doing movl %esp %ebp. The reason for this is that we can now be able to access the function parameters that are pushed earlier onto the stack by caller function as fixed indexes from the base pointer. We cannot use stack pointer directly for accessing parameters because the stack pointer can move while the function is executing.

At this point, the stack looks like this:

Argument N     <--- N*4+4(%ebp)
... 
Argument 2     <--- 16 (%ebp)
Argument 1     <--- 12(%ebp)
Return address <--- 4(%ebp)
Old %ebp       <--- (%esp) and (%ebp)

Using the stack frame

Once we have performed the fix setup, we can now use the stack frame to:

save registers

We need to push all the callee-save registers by convention onto the stack. By convention, registers $\mathtt{\%eax}$, $\mathtt{\%edx}$, $\mathtt{\%ecx}$ are classified as caller-save registers, and $\mathtt{\%ebx}$, $\mathtt{\%esi}$, and $\mathtt{\%edi}$ are classified as callee-save registers. The caller-save registers mean that the caller function is responsible saving these register values because the callee is free to override these register values. On the other hand, the callee-save registers mean that the callee function must save those registers values by pushing them onto the stack before overwritting them, and restore them before the returning because the caller may need these values for its future computations.

Note

Save registers step is not mandatory. If your caller function (or higher-level functions) don't use all these callee-save registers, you are free to skip this step.

local variables

The function reserves space on the stack for any local variables it needs. Space for data with no specified initial value can be allocated on the stack by simply decrementing the stack pointer by an appropriate amount. Similarly, space can be deallocated by incrementing pointer. Suppose we are going to need two words of memory to run a function. We can simply move the stack pointer down two words to serve the space:

subl $8, %esp  # Allocate 8 bytes of space on the stack

While it is possible to make space on the stack as needed in a function body, it is generally more efficient to allocate this space all at once at the beginning of the function. This way, we are free of worring about clobbering them with pushes that we may make for next function calls (i.e. push arguments and return address for the function calls contained inside the current function, which all happens in "Argument build area" in the above picture).

Suppose we save $\mathtt{\%ebx}$ (i.e. callee-save register), and with our two words for local storage, our stack now looks like this:

Argument N      <--- N*4+4(%ebp)
... 
Argument 2      <--- 12 (%ebp)
Argument 1      <--- 8(%ebp)
Return address  <--- 4(%ebp)
Old %ebp        <--- (%esp) and (%ebp)
%ebx            <--- -4(%ebp)
Local variable1 <--- -8(%ebp)
Local variable2 <--- -12(%ebp) and (%esp)

As you can see, we can now access all of the data we need for this function by using base pointer addressing using different offsets from $\mathtt{\%ebp}$. $\mathtt{\%ebp}$ was made specifically for this purpose, which is why it is called the base pointer.

Cleanup

When a function is done executing, it does the following three things:

It stores its return value in $\mathtt{\%eax}$
It frees the stack space it allocated by adding the same amount to the stack pointer （i.e., addl $8 %esp)
It pops off the registers it saved earlier (i.e., popl %ebx)
It resets the stack to what it was when it was called (it gets rid of the current stack frame and puts the stack frame of the caller back into effect)
It returns control back to wherever it was called from. This is done using the ret instruction, which pops whatever value is at the top of the stack, and sets the instruction pointer %\mathtt{\%eip}$ to that value.

The reason we have to restore the caller's base pointer register before calling ret is due to the structure of our current stack frame: in our current stack frame, the return address is not at the top of the stack. Therefore, in order to make the return address to be the top of the stack frame, we need to move the stack pointer $\mathtt{\%esp}$ to the current stack frame base pointer $\mathtt{\%ebp}$ and restore the caller's frame pointer first. Then, the return address will be the top of the stack. Since we need to perform caller's frame pointer restoration anyway, everything just works out.

Thus, to return from the function you have to do the following:

movl %ebp, %esp # Set stack pointer back to the beginning of the frame
popl %ebp       # Restore the caller's base pointer and now stack pointer pointing to Return address
ret             # Since stack pointer pointing to return address, we can now call ret

Note

The step 2 and 3 are unecessary if we don't save any registers at all. The reason we do step 2 is that we need to move the stack pointer pointing to the saved registers. If there are no saved registers, the step 4 can achieve the same effect as step 3 because after you move the stack pointer back, future stack pushes will likely overwrite everything you put there.

Two examples

Now, we can take a look at two examples: the first one is to calculate the power given two numbers: one as the base and the other one as the power. The second example calculate the factorial of a given number, which demonstrates how the recursive function is done.

        # PURPOSE: Program to illustrate how functions work
        #          This program will compute the value of  2^3 + 5^2
        #

        # Everything in the main program is stored in registers,
        # so the data section doesn't have anything.
        .section .data

        .section .text

        .globl _start

_start:
        pushl $3                  # push second argument
        pushl $2                  # push first argument
        call power                # call the function
        addl $8, %esp             # move the stack pointer back
        pushl %eax                # save the first answer before calling the next function

        pushl $2                  # push second argument
        pushl $5                  # push first argument
        call power                # call the function
        addl $8, %esp             # move the stack pointer back

        popl %ebx                 # The second answer is already in %eax. We saved the
                                  # first answer onto the stack, so now we can just pop
                                  # it out into %ebx
        addl %eax, %ebx           # add them together, the result is in %ebx
        movl $1, %eax             # exit (%ebx is returned)
        int $0x80

        # PURPOSE: This function is used to compute the value of a number raised to a power
        #
        # INPUT: First argument - the base number
        #        Second argumnet - the power to raise it to
        #
        # OUTPUT: Will give the result as a return value
        #
        # NOTES: The power must be 1 or greater
        #
        # VARIABLES:
        #        %ebx - holds the base number
        #        %ecx - holds the power
        #        -4(%ebp) - holds the current result
        #
        #        %eax is used for temporary storage
        .type power, @function
power:
        pushl %ebp                # save old base pointer
        movl %esp, %ebp           # make stack pointer the base pointer
        subl $4, %esp             # get room for our local storage

        movl 8(%ebp), %ebx        # put first argument in %eax
        movl 12(%ebp), %ecx       # put second argument in %ecx

        movl %ebx, -4(%ebp)       # store current result

power_loop_start:
        cmpl $1, %ecx             # if the power is 1, we are done
        je end_power
        movl -4(%ebp), %eax       # move the current result into %eax
        imull %ebx, %eax          # multiply the current result by the base number
        movl %eax, -4(%ebp)       # store the current result

        decl %ecx                 # decrease the power
        jmp power_loop_start      # run for the next power

end_power:
        movl -4(%ebp), %eax       # return value goes in %eax
        movl %ebp, %esp           # restore the stack pointer
        pop %ebp                  # restore the base pointer
        ret

The key to understand the function call is to trace through the status of stack frame. One thing to highlight as a side note is:

.type power,@function

This tells the linker that the symbol power should be treated as a function. Since this program is only in one file, it would work just the same with this left out. However, it is good practice.

Note

To run the program on 64-bit platform, we need to simulate 32-bit environment by assembling and linking our program like this:

as --32 power.s -o power.o; ld -m elf_i386 -s power.o -o power

        # PURPOSE: - Given a number, this program computes the factorial. For example,
        #            the factorial of 3 is 3 * 2 * 1, or 6. The factorial of
        #            4 is 4 * 3 * 2 * 1, or 24, and so on.
        #
        # This program shows how to call a function recursively.

        .section .data

        # This program has no global data

        .section .text

        .global _start
        .global factorial # this is unneeded unless we want to share
                          # this function among other programs

_start:
        pushl $4          # The factorial takes one argument - the number we want
                          # a factorial of. So, it gets pushed.
        call factorial    # run the factorial function
        addl $4, %esp     # scrubs the parameter that was pushed on the stack
        movl %eax, %ebx   # factorial returns the answer in %eax, but we want it
                          # in %ebx to send it as our exit status
        movl $1, %eax     # call the kernel's exit function
        int $0x80

        # This is the actual function definition
        .type factorial, @function
factorial:
        pushl %ebp         # standard function stuff - we have to restore %ebp
                           # to its prior state before returning, so we have to push it
        movl  %esp, %ebp   # This is because we don't want to modify the stack pointer
                           # so we use %ebp
        movl 8(%ebp), %eax # This moves the first argument to %eax
                           # 4(%ebp) holds the return address, and 8(%ebp) holds the first parameter
        cmpl $1, %eax      # if the number is 1, this is our base case, and we simply
                           # return (1 is already in %eax as the return value)
        je end_factorial
        decl %eax          # otherwise, decrease the value
        pushl %eax         # push it for our call to factorial
        call factorial     # call factorial
        movl 8(%ebp), %ebx # %eax has the return value, so we reload our parameter
                           # into %ebx
        imull %ebx, %eax   # multiply that by the result of the last call to factorial
                           # (in %eax) the answer is stored in %eax, which is good since
                           # that's where return values go.
end_factorial:
        movl %ebp, %esp    # standard function return stuff - we have to restore
        popl %ebp          # %ebp and %esp to where they were before the function started
        ret                # return to the function (this pops the return value, too)

One good practice we should note is that we should always clean up our stack parameter after a function call returns. In this program, we do addl $4, %esp immediately after we call factorial in our _start.

Future works

For this post, I assume we work with x86 32-bit processor. It's interesting to further investigate how the things changed for the 64-bit world.

Links to resources

Here are some of the resources I found helpful while preparing this article:

For scaled indexed addressing, it actually includes both base pointer addressing mode (i.e. movl 4(%eax), %ebx) and indexed addressing mode (i.e., movl string_start(, %ecx, 1), %eax). ↩

Andrew Ng's ML Week 06, 11

2017-07-21T12:51:00+08:00

I actually finished the course on June 25th. In this page, I'll summarize various advices and tips given by Prof. Andrew Ng on how to build a effective machine learning system.

To be honest, I used to think this part of material may be not worth a post but as I dig deeper into the course I find out that this part is invaluable because it answers some commonly-seen questions when implementing a machine learning system, which can be a huge time-saver. So, I think I need a post to record those advices systematically.

Preface
Diagnostics
Overfitting vs. Underfitting
What to try next?
Neural Network and overfitting
Error analysis
Ceiling analysis
Other issues
- Error metrics for skewed classes: Precision & Recall
- Data for machine learning

Preface

One important question we may ask after implementing our machine learning algorithm is that: how good is our learning algorithm? In addition, for instance, after we have implemented regularized linear regression to predict housing prices and when we test our hypothesis on a new set of houses, we find that it makes unacceptably large errors in the predictions, what we should try next? This post aims to answer those questions.

In this post, I will first take a look at the diagnostic to evaluate learning algorithm. Then, I will define overfitting (high bias) and underfitting (high variance) concepts and the concrete techniques to identify which is which. Afterwards, I will talk about several ways to handle the problem and highlight some key points. Lastly, we will take a look at some special cases when data is skewed or large.

Diagnostics

Diagnostics is a test that you can run to gain insight what is or isn't working with a learning algorithm, and gain guidance as to how best to improve its performance. We use test set error as our basic metrics to evaulate our learning algorithm (hypothesis).

We first shuffle our whole data set to eliminate the potential impact of data record ordering. Then, we randomly choose $70\%$ of data set as our training set and the rest $30\%$ as our test set. Mathematically, we denote training set: $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})$ and we denote test set as $(x_{\text{test}}^{(1)}, y_{\text{test}}^{(1)}), (x_{\text{test}}^{(1)}, y_{\text{test}}^{(1)}) \dots (x_{\text{test}}^{(m_\text{test})}, y_{\text{test}}^{(m_\text{test})})$ with $m_\text{test} = \text{no. of test examples}$.

For linear regression, our test set error is calculated by the following steps:
1. Learn parameters $\theta$ from training data (i.e. minimizing training error $J(\theta)$)
2. compute the test set error as follows:
$$ J_\text{test}(\theta) = \frac{1}{2m_\text{test}}\sum_{i=1}^{m_{test}}(h_\theta(x_{\text{test}}^{(i)})-y_\text{test}^{(i)})^2 $$
For logistic regression, we can use similar way like linear regression to calculate test set error but there is a way due to the nature of classification task. In this case, we also call test set error as misclassification error or (0/1 misclassification error):

$$ \text{err}(h_\theta(x),y)=\left\{ \begin{array}{ll} 1 \text{ if } h_\theta(x) \ge 0.5, y = 0 \text{ or if } h_\theta(x) < 0.5, y = 1 \\ 0 \text{ otherwise } \end{array} \right. $$

This definition gives us a binary $0$ or $1$ error result based on a misclassification. Then, we calculate test set error as

$$ \text{Test error} = \frac{1}{m_\text{error}}\sum_{i=1}^{m_\text{test}} \text{err}(h_\theta(x_\text{test}^{(i)}),y_\text{test}^{(i)}) $$

This gives us the proportion of the test data that was misclassified.

In addition to the test set error, we will define cross validation set error as well. Instead of dividing the whole data set as training set and test set, we can divide it into three parts: training set, corss validation (cv) set, and test set, with proportion of data set as $60\%$, $20\%$, and $20\%$. Mathematically, similar to the notation of test set, we have $(x_{\text{cv}}^{(1)}, y_{\text{cv}}^{(1)}), (x_{\text{cv}}^{(1)}, y_{\text{cv}}^{(1)}) \dots (x_{\text{cv}}^{(m_\text{cv})}, y_{\text{cv}}^{(m_\text{cv})})$ with $m_\text{cv} = \text{no. of cv examples}$. The purpose of dividing data set in this way will be clear in the next section.

Now, we summarize our metrics (training error, cross validation error, and test set error) as follows:

$$ \begin{eqnarray*} J_\text{train}(\theta) &=& \frac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 \\ J_\text{cv}(\theta) &=& \frac{1}{2m_\text{cv}}\sum_{i=1}^{m_{cv}}(h_\theta(x_{\text{cv}}^{(i)})-y_\text{cv}^{(i)})^2 \\ J_\text{test}(\theta) &=& \frac{1}{2m_\text{test}}\sum_{i=1}^{m_{test}}(h_\theta(x_{\text{test}}^{(i)})-y_\text{test}^{(i)})^2 \end{eqnarray*} $$

Overfitting vs. Underfitting

In week 01-03 post, we mention the term overfitting when we talk about regularization. Now, we explain it in details. Overfitting happens when we have too many features, the learning hypothesis may fit the training set very well (i.e. $J(\theta)=\frac{1}{2m} \sum_{i=1}^m(\theta^T x^{(i)}-y(i))^2 \approx 0$) but fail to generalize to new examples (i.e. predict prices on a new set of house). Take a look at the following three graphs

The first graph (leftmost) shows the result of fitting our training set with a hypothesis: $h_\theta(x) = \theta_0 + \theta_1 x$. You can see that the data doesn't really lie on straight line, and so the fit is not very good. In this case, we call the scenario underfitting, which means the model doesn't capture the data structure well. Another term for this is high bias. One way to think of high bias is that the algorithm has strong preconception on what the data should be, in our case, linear. In summary, underfitting, or high bias, is when the form of our hypothesis function $h$ maps poortly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.

At the other extreme, shown by the rightmost graph, is overfitting, or high variance. High variance means that the function can almost fit any function: hypothesis $h$ is too general and we don't have enough data to constrain it. The overfitting or high variance is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

There are a couple of ways to tackle the overfitting problem:

Reduce number of features
- Manually select which feature to keep
- Model selection algorithm
Regularization, which can keep all the features, but reduce magnitude/values of parameters $\theta_j$. This way works well when we have a lot of features, each of which contributes a bit to predicting $y$.

Model selection algorithm

Once parameters $\theta_0, \theta_1, \dots$ were fit to some set of data (training set), the error of the parameters as measured on that data (i.e. $J_\text{train}\theta$) is likely to be lower than the actual generalization error. In other words, $J_\text{train}\theta$ will be a bad metric on predicting how well our hypothesis will be generalized to new examples. So, how do we measure how well our hypothesis will perform on new examples? In addition, how to select which model to use? Ideally, we should pick the model that has the best performance on new examples. As you can tell, these two questions are equivalent and are all centered around the metrics we use for reporting our model generalization error.

We can start with the following schemes to pick our model. We use $d$ to denote the degree of polynomial of our model. For example, $d = 1$ means $h_\theta(x) = \theta_0 + \theta_1 x$; $d=2$ means $h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2$. Then, we can do:

Optimize the parameters in $\Theta$ using the training set for each polynomial degree $d$.
Find the polynomial degree $d$ with the least error $J_\text{test}(\theta)$ using the test set. We pick the model with this $d$ and report our test set error $J_\text{test}(\theta)$ as the metric for estimate of generalization error.

However, there is a problem with this scheme: we use our extra parameter $d$ to fit the test set. In other words, we choose $d$, then we fit with $J_\text{test}(\theta)$. Our estimate is likely optimistic and our model is likely do better on test set than on new examples hasn't seen before. This is similar to overfitting in training set.

In order to fix this problem, we introduce cross validation set. We modify above scheme as follows:

Optimize the parameters in $\Theta$ using the training set for each polynomial degree
Find the polynomial degree $d$ with the least error using the cross validation set
Estimate the generalization error using the test set with $J_\text{test}(\theta^{(d)})$ ($\theta^{(d)}$ is the parameter $\Theta$ from polynomial with the lowest error)

This way, our $d$ has not been trained using the test set.

Diagnosing bias vs. variance: which is which?

Once we have the metrics and the understanding of cross validation set, we can now find out whether bias or variance is the problem contributing to bad predictions. We have the following picture to help us understand the relationship bewtween $d$ and the underfitting (high bias) or overfitting (high variance) of our hypothesis

The training error will tend to decrease as we increase the degree $d$ of polynomial because our hypothesis fitness to our training data becomes better and better. On the other hand, the cross validation error will tend to decrease as we increase $d$ up to a point (because our model can generalize well), and then it will increase as $d$ increased (because we now overfit the training data and cannot be generalize well in cross validation set), forming a convex curve. So now, based on the picture, we can answer the question: suppose the learning algorithm is performing less well than you were hoping ($J_\text{cv}(\theta)$ or $J_\text{test}(\theta)$ is high). is it a bias problem or a variance problem?

High bias (underfitting): $J_\text{train}(\theta)$ will be high; $J_\text{cv}(\theta) \approx J_\text{train}(\theta)$
High variance (overfitting): $J_\text{train}(\theta)$ will be low; $J_\text{cv}(\theta) \gg J_\text{train}(\theta)$

Regularization: how to choose $\lambda$?

In the overfitting section above, we know that regularization is another way to handle the overfitting. There is a problem with regularization method: how do we set $\lambda$ appeard in the $J_\theta(x)$? In general, when $\lambda$ is large, we tend to underfit (i.e. high bias) the data and when $\lambda$ is small, we tend to overfit (i.e. high variance). In the course, the following method is proposed:

Create a list of $\lambda$ (i.e. $\lambda = 0, 0.01, 0.02, 0.04, \dots, 10.24$ (multiple of 2))
Create a set of models with different degrees or any other variants
Iterate through $\lambda$s and for each $\lambda$, go through all the models to learn some $\theta$.
Compute the cross validation error $J_{cv}(\theta)$ without regularization term (i.e. $\lambda = 0$) using the learned $\theta$.
Select the best combo that produces the lowest error on the cross validation set
Using the best combo $\lambda$ and $\theta$, apply it on $J_{test}(\theta)$ to see if it has a good generalization.

We can also plot the Bias/Variance as a function of the regulariation parameter like below:

Learning curves

Learning curve is a tool to help us identfy whether we are facing underfitting (i.e. high bias) or overfitting (i.e. high variance) problem and at the same time, gives us a way to answer the question: Will getting more training data help us improve our learning algorithm performance?

The following picture shows what learning curves look like for a linear regression (i.e. $h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2$):

The x-axis of the learning curves is the training set size $m$ and the y-axis is the error. When $m$ is small, any hypothesis can fit the training data perfectly, and thus our $J_{train}(\theta)$ is small. However, as $m$ increases, our hypothesis cannot fit all the data, and thus $J_{train}(\theta)$ increases. On the other hand, when $m$ is small, our hypothesis cannot generalize much and thus $J_{cv}(\theta)$ tends to be high. However, as $m$ increases, the more data we have, the better hypothesis we can get and thus our hypothesis can generalize well to new examples and $J_{cv}(\theta)$ decreases.

Experience underfitting

Now, let's see what learning curves look like when we face underfitting (i.e. high bias) problem. For example, we try to fit our data with hypothesis $h_\theta(x) = \theta_0 + \theta_1x$. Now, when $m$ is small, our $J_{train}(\theta)$ will be small, and it will increase as $m$ increases. After certain point, our $J_{train}(\theta)$ will flat out because our hypothesis is a straight line and more data won't help much. On the other hand, $J_{cv}(\theta)$ will be high when $m$ small and will decrease as $m$ increases. Similar to $J_{train}(\theta)$, $J_{cv}(\theta)$ will quickly flat out because number of hypothesis parameter is so small and it won't generalize well as data increases. In other words, when we have high bias, the performance of $J_{cv}(\theta)$ and $J_{train}(\theta)$ will look a lot similar. Thus, our learning curve looks something like this:

Note

$m$ is small: causes $J_{train}(\theta)$ to be low and $J_{cv}(\theta)$ to be high.
$m$ is large: causes both $J_{train}(\theta)$ and $J_{cv}(\theta)$ to be high with $J_{train}(\theta) \approx J_{cv}(\theta)$

From above graph we can get, if a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

Experience overfitting

For overfitting (i.e. high variance) problem, let's consider a hypothesis: $h_\theta(x) = \theta_0 + \theta_1x + \dots + \theta_{100}x^{100}$ (and small $\lambda$). When $m$ is small, $J_{train}(\theta)$ is small because we fit very small data size with very high degree polynomial. As $m$ increases, $J_{train}(\theta)$ increases but not so much because with high polynomial degree, even we cannot fit the data perfectly, our hypothesis is still pretty good with large data size. For $J_{cv}(\theta)$, we have overfitting problem no matter the size of $m$. Even $m$ increases may help our hypothesis generalize better, our hypothesis can still do poorly for new examples. So, the learning curves for high variance problem looks like below:

Note

$m$ is small: $J_{train}(\theta)$ will be low and $J_{cv}(\theta)$ will be high.
$m$ is large: $J_{train}(\theta)$ increases with training set size and $J_{cv}(\theta)$ continues to decrease without levelling off. Also $J_{train}(\theta) < J_{cv}(\theta)$ but the difference between them remains significant.

From above graph, if we keep getting more data, $J_{cv}(\theta)$ will keep getting down and this indicates that if a learning algorithm is suffering from high variance, getting more training data is likely to help.

Note

In the learning curves, $J_{cv}(\theta)$ can be substituted with $J_{test}(\theta)$ and the shape still holds. All in all, we really care about the size of $J_{cv}(\theta)$ (or $J_{test}(\theta)$).

What to try next?

Now, we can answer the question appeard in the perface section: suppose we have implemented regularized linear regression to predict housing prices. However, when we test our hypothesis in a new set of houses, we find that it makes unacceptablely large errors in its prediction. What we should try next?

Method	When it works?
Get more training examples	high variance
Try smaller sets of features	high variance
Try getting additional features	high variance
Try adding polynomial features (i.e. $x_1^{2}$, $x_2^{2}$, $x_1x_2$)	high bias
Try decreasing $\lambda$	high bias
Try increasing $\lambda$	high variance

This leads to model complexity effects:

Lower-order polynomial (low model complexity) have high bias and low variance. The model fits poorly consistently.
Higher-order polynomial (high model complexity) fits training data extremely well but test data poorly. Low bias but high variance on training data.

Ideally, we should choose a model somewhere in between.

Neural Network and overfitting

Underfitting and overfitting also exist in the neural network. When we use "small" neural network (i.e. less hidden layers, less hidden units), we have fewer parameters and more prone to underfitting. In the contrast, when we use "large" neural network, it's computationally more expensive and more parameters means more prone to overfitting. In this case, we use $\lambda$ to address the issue.

We also face the similar model selection problem like we have when working with linear regression. In the neural network setting, that means deciding number of hidden layers we need to use in the network. Prof. Ng talks about the following method to solve the problem:

We create a list of number of hidden layers
For each number of hidden layers, we optimize the parameters in $\Theta$ using the training set
Find the number of hidden layers with the least error using the cross validation set

Usually, using a single hidden layer is a good starting default. We can then train our neural network on a number of hidden layers using our cross validation set. We can then choose the one that performs the best.

Error analysis

Error analysis means manually examine the examples (i.e. cross validation set) that your algorithm made errors on and see if you spot any systematic trend in what type of examples it is making errors on. Then, we need to try some method to see if it helps. The key point during the whole analysis is that we need to come up some numeric evaluation, which gives a single raw number, to determine how system works.

This leads to a recommended approach for handling machine learning problem:

Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation set.
Plot learning curves to decide if more data, more features, etc, are likely to help
Perform error analysis

Ceiling analysis

Ceiling analysis is helpful when we work on a machine learning system, which contains many components. Then ceiling analysis tries to address the question: what part of the pipeline we need to focus on to improve next? This is done by estimating the errors due to each component. Suppose we have a image recognition system with four components:

$$ \text{images} \Rightarrow \text{Text detection} \Rightarrow \text{character segmentation} \Rightarrow \text{character recognition} $$

Then, we try to decide which part of the pipeline we should spend the most time trying to improve.

component	Accuracy of overall system
Overall system	72%
Text detection	89%
Character segmentation	90%
Character recognition	100%

Before we improve some component of the system, we have a overall system accuracy of $72\%$. Now, let's take text detection as an example. Let's manually label where the text is and this will give $100\%$ accuracy of text detection. Then, we run the rest modules, and get an overall system accuracy, which in our case is $89\%$. We perform the similar steps for each component. Then, we calculate the gain of system accuracy. For example, we will get $17\%$ gain by working on text detection (i.e. $89\% - 72\%$) and $1\%$ by working on character segmentation, and $10\%$ by working on character recognition. As you can see, text detection will give us the largest gain, and thus we should work on this componet next.

Other issues

Machine learning is interesting because there probably doesn't exist a unified way to handle different kind of data. This means for some special data, we may need to have some special approach to handle them.

Error metrics for skewed classes: Precision & Recall

Skew classes

Consider the following example: we want to do a cancer classification using logistic regression with $y=1$ indicating cancer and $y=0$ otherwise. After training on the training data, we find that we got $1\%$ error on test set (i.e. $99\%$ correct diagnoses). Can we say that our learning algorithm is performing well? The answer is depends. We further examine the training data and find out that only $0.5\%$ of patients have cancer. This causes problem to our training task because we can directly set $y=0$ for every training data and we will get only $0.5\%$ error, which is less than $1%$ in our previous case. However, this error is useless. This is a typical scenario of skew classes, where the number of possitive examples $\ll$ the number of negative examples.

Precision & Recall

Rather than using classification error as a measurement to our learning algorithm performance, we use precision and recall instead when we deal with skewed class.

Let $y = 1$ in presence of rare class that we want to detect
$\text{Precision} = \frac{\text{True positives}}{\text{# predicted positives}} = \frac{\text{True positive}}{\text{True positive}+\text{False positive}}$ (i.e. of all patients where we predicted $y = 1$, what fraction actually have cancer?)
$\text{Recall} = \frac{\text{True positives}}{\text{# actual positives}} = \frac{\text{True positives}}{\text{True positives}+\text{False negatives}}$ (i.e. of all patients that actually have cancer, what fraction did we correctly detect as having cancer?)

Note

If we apply this concept to the previous scenario when we set $y=0$ for each training data, then the recall will be $0$.

There is a tradeoff between precision and recall. Suppose we have a logistic regression: $0 \ge h_\theta(x) \le 1$ and we want to predict $1$ if $h_\theta(x) \ge \text{threshold}$ and $0$ if $h_\theta(x) \le \text{threshold}$. Then, how do we determine that threhold value?

Suppose we want to predict $y=1$ (i.e. cancer) only if very confident: Higher precision, lower recall (i.e. threhold = $0.7$)
Suppose we want to avoid missing too many cases of cancer (i.e. avoid false negatives): Higher recall, lower precision (i.e. threshold = $0.3$)

As you can see, we cannot maintain high precision and high recall at the same time. This kind of tradeoff is dicpted in the picture below:

$F_1$ score

Now, we may ask if there is a way to choose our threshold value automatically. That's where $F_1$ score comes from. If we use $P$ to denote the precision and $R$ to denote recall, then our $F_1$ score is defined as follows

$$ F_1 \text{ score} = 2 \frac{PR}{P+R} $$

If $P = 0$ or $R = 0$, then $F_1 \text{ score} = 0$
If $P = 1$ and $R = 1$, then $F_1 \text{ score} = 1$

Then, we can pick the threshold value by measuring $P$ and $R$ on the cross validation set and choose the value of threshold which maximizes our $F_1 \text{ score}$.

Data for machine learning

When we should use a very large training set? Two things need to consider before we do that:

Assume feature $x \in R^{n+1}$ has sufficient information to predict $y$ accurately. This can be tested by answering the question: given the input $x$, can a human expert confidently predict $y$?
Use a learning algorithm with many parameters (i.e. logistic regression or linear regression with many features; neural network with many hidden units), which can give us a low bias algorithm.

Virtual methods and polymorphism in C++

2017-07-12T23:11:00+08:00

Virtual methods
Pure virtual function & abstract base class
Key terms summary

Surprisingly, even I work with the product that is written majorly in C++, I don't have to deal with the stuff that differentiate C++ from C. However, I'm now working on a defect that forces me to simplify C++ objects in order to get the root cause of the problem. That's the place where I have to really need to know how exactly C++ class is structured.

One question I asked myself several years ago: "What is virtual method in C++?" I believed, at that time, I got the answer but I was too lazy to record it somewhere. Now, I have to pay the price by wasting my effort again to dig out the answer. So, I'd better save it at someplace this time and the following is just a simple example to partially show the answer to the question. I know C++ is a monster and I'll definitely need to rewrite the post some day in the future when I know more about the language. However, this answer is good enough for me now.

Virtual methods

Let's consider the following code snippet: we have a base class called Animal and its subclass called Dog

#include <iostream>
using namespace std;

class Animal{
public:
  void getFamily() { cout << "We are animals" << endl; }
  void getClass() { cout << "I'm an Animal" << endl;}
};

class Dog: public Animal{
public:
  void getClass() { cout << "I'm a Dog" << endl;}
};

int main()
{
  Animal *animal = new Animal;
  Dog *dog = new Dog;

  animal->getClass();
  dog->getClass();
}

Now, inside the main, we call animal->getClass() and dog->getClass(). We compile our code using g++ -std=c++11 a.cpp and run the program and get

I'm an Animal
I'm a Dog

As you can see, each object calls their getClass method respectively. Now, let's add a function called whatClassAreYou() to our code above

void whatClassAreYou(Animal *animal)
{
  animal->getClass();
}

and in our main function, we call our newly-added function with

whatClassAreYou(animal);
whatClassAreYou(dog);

and the output looks like below

I'm an Animal
I'm a Dog
I'm an Animal
I'm an Animal

As you can see whatClassAreYou() only calls the getClass() method of our base class Animal even when we pass in Dog class object. Ideally, we want our whatClassAreYou() method call the right getClass() method depending on what class object we pass into. In other words, if our Dog class implements its own getClass() method, we want our whatClassAreYou() method be aware of this fact and call it instead of calling getClass() of our base class Animal. That's why we want to add virtual keyword to the getClass() of our base class. We are essentially telling the compiler that our base class getClass() method might be overridden by its subclass and be aware of this fact when some other method wants to call it.

Now our code looks like below

#include <iostream>
using namespace std;

// Virtual Methods and Polymorphism
// Polymorphism allows you to treat subclasses as their superclass and yet
// call the correct overwritten methods in the subclass automatically

class Animal{
public:
  void getFamily() { cout << "We are animals" << endl; }

  // When we define a method as virtual we know that Animal
  // will be a base class that may have this method overwritten
  virtual void getClass() { cout << "I'm an Animal" << endl;}
};

class Dog: public Animal{
public:
  void getClass() { cout << "I'm a Dog" << endl;}
};

void whatClassAreYou(Animal *animal)
{
  animal->getClass(); // use "virtual", proper getClass() method will be called depending on
                      // the exact type of Animal* animal get passed in (i.e. base class Animal
                      // or subclass Dog)
}

int main()
{
  Animal *animal = new Animal;
  Dog *dog = new Dog;

  // If a method is marked virtual or not doesn't matter if we call the
  // method directly from the object
  animal->getClass();
  dog->getClass();

  whatClassAreYou(animal);
  whatClassAreYou(dog);
}

and the output is

I'm an Animal
I'm a Dog
I'm an Animal
I'm a Dog

The reason behind this scenario is what we called polymorphism, which means "many form" in Greek. Here is how this concept get explained in C++ Primer:

We speak of types related by inheritance as polymorphic types, because we can use the “many forms” of these types while ignoring the differences among them. The fact that the static and dynamic types of references and pointers can differ is the cornerstone of how C++ supports polymorphism.

When we call a function defined in a base class through a reference or pointer to the base class, we do not know the type of the object on which that member is executed. The object can be a base-class object or an object of a derived class. If the function is virtual, then the decision as to which function to run is delayed until run time. The version of the virtual function that is run is the one defined by the type of the object to which the reference is bound or to which the pointer points. On the other hand, calls to nonvirtual functions are bound at compile time. Similarly, calls to any function (virtual or not) on an object are also bound at compile time. The type of an object is fixed and unvarying—there is nothing we can do to make the dynamic type of an object differ from its static type. Therefore, calls made on an object are bound at compile time to the version defined by the type of the object.

Note

Virtuals are resolved at run time only if the call is made through a reference or pointer. Only in these cases is it possible for an object’s dynamic type to differ from its static type.

To see the final note of the above quote, let's take a look an example

#include <iostream>
using namespace std;

class Animal{
public:
  void getFamily() { cout << "We are animals" << endl; }
  virtual void getClass() { cout << "I'm an Animal" << endl;}
};

class Dog: public Animal{
public:
  void getClass() { cout << "I'm a Dog" << endl;}
};

void whatClassAreYou(Animal *animal)
{
  animal->getClass(); 
}

void whatClassAreYou2(Animal animal)
{
  animal.getClass();
}

int main()
{
  Animal *animal = new Animal;
  Dog *dog = new Dog;

  animal->getClass();
  dog->getClass();

  whatClassAreYou(animal);
  whatClassAreYou(dog);

  Animal animal2;
  Dog dog2;

  whatClassAreYou2(animal2);
  whatClassAreYou2(dog2);
}

In this example, we define another method whatClassAreYou2 calls on object instead of pointers. Now, we apply this method to our newly created objects animal2 and dog2 and we get

I'm an Animal
I'm a Dog
I'm an Animal
I'm a Dog
I'm an Animal
I'm an Animal

As you can see, even we have virtual function in our base class, whatClassAreYou2() invokes only the base class's getClass() method and ignores the subclass overrides.

Aside note, you may notice that we use two ways to initialize our objects. The first way is through Animal *animal = new Animal; and the second way is through Animal animal2;. Initialization is quite complex in C++. These two ways are essentially the same: we use default constructor ¹ to initialize the objects. The only difference is that the first way gives us a pointer to the object and the second way gives object directly. To obtain the pointer to the object, we can do Animal *ptrAnimal = &animal2. Another way to initialize object is through value initilialization ². For example:

string *ps1 = new string;   // default initalized to the empty string
string *ps = new string();  // value initialized to the empty string
int *pi1 = new int;         // default initialized; *pi1 is undefined
int *pi2 = new int();       // value initialized to 0; *pi2 is 0

Pure virtual function & abstract base class

Another important concept related with virtual is called pure virtual. The difference between this concept is that, as stated in wikipedia:

A virtual function or virtual method is a function or method whose behavior can be overriden within an inheriting class by a function with the same signature. A pure virtual function or pure virtual method is a virtual function that is required to be implemented by a derived class that is not abstract.

In short, virtual function can be overriden; pure virtual function must be implemented. Let's take a look a example

#include <iostream>
using namespace std;

// An abstract data type is a class that acts as the base to other classes
// They stand out because its methods are initialized with zero
// A pure virtual method must be overwritten by subclasses

class Car
{
public:
  virtual int getNumWheels() = 0;
  virtual int getNumDoors() = 0;
};

class StationWagon : public Car
{
public:
  int getNumWheels() { cout << "Station wagon has 4 wheels" << endl; }
  int getNumDoors() { cout << "Station wagon has 4 doors" << endl;}
  StationWagon() {}
  ~StationWagon();
};

int main()
{
  // Create a StationWagon using the abstract data type Car
  Car *stationWagon = new StationWagon();

  stationWagon -> getNumWheels();

  return 0;
}

Here, we have a base class called Car and a class that derives from the Car class called StationWagon. Car is a lot similar to our Animal class in the sense that both classes have methods that have keyword virtual. However, Car class's methods have =0. This is exactly how we identify pure virtual: we specify that a virtual function is a pure virtual by writing =0 in place of a function body (i.e., just before the semicolon that ends the declaration).

Note

Unlike ordinary virtuals, a pure virtual function does not have to be defined.
=0 may appear only on the declaration of a virtual function in the class body.

A class like Car that contains (or inherit without overriding) a pure virtual function is an abstract base class. An abstract base class defines an interface for subsequent classes to override. We cannot (directly) create objects of a type that is an abstract base class.

Key terms summary

Here, I summarize terms appeard in this post as a quick index for future reference:

virtual: Member function that defines type-specific behavior. Calls to virtual made through a reference or pointer are resolved at run time, based on the type of the object to which the reference or pointer is bound.
pure virtual: Virtual function declared in the class header using =0 just before the semicolon. A pure virtual function need not be (but maybe) defined. Classes with pure virtuals are abstract classes. If a derived class does not define its own version of an inherited pure virtual, then the derived class is abstract as well.
polymorphism: As used in object-oriented programming, refers to the ability to obtain type-specific behavior based on the dynamic type of a reference or pointer.
static type: Type with which a variable is defined or that an expression yields. Static type is known at compile time.
dynamic type: Type of an object at runtime. The dynamic type of an object to which a reference refers or to which a pointer points may differ from the static type of the reference or pointer. A pointer or reference to a base-class type can refer to an object of derived type. In such cases the static type is reference (or pointer) to base, but the dynamic type is reference (or pointer) to derived.
abstract base class: Class that has one or more pure virtual functions. We cannot create objects of an abstract base-class type.

See Section 7.1.4 Constructors (p.262) of C++ Primier (5th edition) for details. ↩
See Section 12.1.2 Managing Memory Directly (p.459) of C++ Primier (5th edition) for details. ↩

Merge sort

2017-07-01T23:33:00+08:00

We continue our journey in sorting. Specifically, we'll study the mergesort in this post.

Concept

The fundamental idea in the mergesort is merging two sorted lists into one. Because the lists are sorted, this can be done in one pass through the input, if the output is put in a third list. The basic merging algorithm takes two input arrays $A$ and $B$, an output array $C$, and three counters, Aptr, Bptr, and Cptr, which are initally set to the beginning of their respective arrays. The smaller of A[Aptr] and B[Bptr] is copied to the next entry in $C$, and the appropriate counters are advanced. When either input list is exhausted, the remainder of the other list is copied to $C$.

The running time for merging is $O(N)$, because at most $N-1$ comparisons are made, where $N$ is the total number of elements. To see this, note that every comparison adds an element to $C$, except the last comparison, which adds at least two.

Once we have this idea in mind, we can now describe our mergesort algorithm:

If $N=1$, there is only one element to sort, and we are done.
Otherwise, we recursively mergesort the first half and the second half. This gives two sorted halves, which can then be merged together using the merging algorithm.

As you can see, our mergesort is a classic example of divide-and-conquer strategy. The problem is divided into smaller problems and solved recursively. The conquering phase consists of patching together the answers.

The mergesort algorithm can be implemented as follows:

void
merge(int A[],
      int tmpArray[],
      int lpos, // start of left half
      int rpos, // start of right half
      int rightEnd)
{
  int i, leftEnd, numElements, tmpPos;

  leftEnd = rpos - 1;
  tmpPos = lpos;
  numElements = rightEnd - lpos + 1;

  // main loop
  while(lpos <= leftEnd && rpos <= rightEnd)
    if(A[lpos] <= A[rpos])
      tmpArray[tmpPos++] = A[lpos++];
    else
      tmpArray[tmpPos++] = A[rpos++];

  while(lpos <= leftEnd) // Copy rest of first half
    tmpArray[tmpPos++] = A[lpos++];
  while(rpos <= rightEnd) // Copy rest of second half
    tmpArray[tmpPos++] = A[rpos++];

  for(i = 0; i < numElements; i++, rightEnd--) // copy tmpArray back
    A[rightEnd] = tmpArray[rightEnd];
}

void
msort(int A[],
      int tmpArray[],
      int left,
      int right)
{
  int center;
  if(left < right)
  {
    center = (left + right)/2;
    msort(A, tmpArray, left, center);
    msort(A, tmpArray, center+1, right);
    merge(A, tmpArray, left, center+1, right);
  }
}

void
mergeSort(int A[], int N)
{
  int *tmpArray;
  tmpArray = malloc(N*sizeof(int));
  assert(tmpArray);
  msort(A, tmpArray, 0, N-1);
  free(tmpArray);
}

Note that we use tmpArray working as array $C$ in our merging algorithm to hold the merge result from our two input sorted arrays. One naive implementation is that we declare a temporary array locally each time we call Merge. This can be problematic because there could be $\log N$ temporary arrays active at any point. This could be fatal on a machine with small memory and at the same time, we will spend quite a lot time calling malloc.

The trick for our implementation is that we declare a global temporary array tmpArray of size $N$ at the very beginning. Then, we use lpos, rpos, rightEnd to control the fraction of tmpArray will be used for merge step. This is a common implementation trick, which will visit again immediately.

Like many other recursive algorithm, mergesort can also be implemented as non-recursive algorithm as follows:

void
mergeSortNonRecursive(int A[], int N)
{
  int *tmpArray;
  int subListSize, part1Start, part2Start, part2End;

  tmpArray = malloc(sizeof(int) * N);
  for(subListSize = 1; subListSize < N; subListSize *= 2)
  {
    part1Start = 0;
    while(part1Start + subListSize < N - 1)
    {
      part2Start = part1Start + subListSize;
      part2End = min(N - 1, part2Start + subListSize - 1);
      merge(A, tmpArray, part1Start, part2Start, part2End);
      part1Start = part2End + 1;
    }
  }
}

Let's take a look at an example for better understanding of the implementation above. Suppose we want to sort the following list using mergesort: $31, 41, 59, 26, 53, 58, 97$. We start from very basic case: merge two sorted list of one element, into a sorted list of two elements. For example, part1Start = 0, part2Start = 1, part2End = 1 for the first iteration of while loop when subListSize = 1. Then, we call merge function and use the fraction of tmpArray from 0 to 1 to hold the merge result. We can print out the value part1Start, part2Start, and part2End to help us better understand the flow of the program:

Before mergeSort: 31, 41, 59, 26, 53, 58, 97,
part1Start: 0
part2Start: 1
part2End: 1
part1Start: 2
part2Start: 3
part2End: 3
part1Start: 4
part2Start: 5
part2End: 5
part1Start: 0
part2Start: 2
part2End: 3
part1Start: 0
part2Start: 4
part2End: 6
After mergeSort: 26, 31, 41, 53, 58, 59, 97,

Analysis

The running time of mergesort is $O(N \log N)$, which can be obtained by solving the recurrence relation:

$$ \begin{eqnarray*} T(1) &=& 1 \\ T(N) &=& 2T(N/2) + N \end{eqnarray*} $$

One thing to notice that we assume $N = 2^k$ when solve the above recurrence relation. The answer is almost identical even if $N$ is not a power of $2$.

Final remarks

We hardly use mergesort for main memory sorts. The main problem is that merging two sorted lists requires linear extra memory, and the additional work spent coping to the temporary array and back, throughout the algorithm, has the effect of slowing down the sort considerably. Thus, for serious internal sorting applications, we use quicksort instead. Nevertheless, the merging routine is the cornerstone of most external sorting algorithms.

Links to resources

Here are some of the resources I found helpful while preparing this article:

MAW Chapter 7

join in SQL

2017-06-23T21:23:00+08:00

In this post, I'll provide a summary on the usage of join statement in SQL and use Leetcode 175 as a concrete example to show several equivalent join statement usage.

To be honest, this is my third time visiting this material. The very first time happened when I took DB course in college, and the second time was when I joined the federation team at IBM and learned about DB2. Unfortunately, I didn't keep my study notes well in the first two tries and I don't write SQL a lot during my day to day work. Things, again, get rusty very quickly. This time I want to do a better job by, at least, saving my notes in a good place ¹.

Summary
Motivation
- Cartesian product
join
- SQL perspective
  - inner join
  - outer join
    - Motivation
    - outer join
- Relational algebra perspective
Examples
- Leetcode 175
- Example Two
Links to resources

Summary

natural join produces a relation from two relations by considering only those pairs of tuples with the same value on those attributes that appear in the schemas of both relations.
join ... using specifies a subset of common attributes to join.
join ... on specifies a predicate to use on join.
outer join is used when we want to preserve the tuples that may have null value on the common attributes of either or both of the relations that we want to join on.

Motivation

Before we directly jump into the SQL, I want to briefly talks about the motivation for the join statement. Specifically, why do we need it? In short, join is used as a shorthand for a widely-used type of query where we want to equate two columns in two tables in the where clause (e.g., T1.a = T2.a).

To retrieve data from multiple relations (i.e. more than one table), we can either use a cartesian product or join of columns of the same data type.

Cartesian product

The cartesian product happens to the relations listed in the from clause of a SQL. The end result of cartesian product is a relation that has all attributes from all the relations in the from clause. The following iterative process shows how the cartesian product of the relations in the from clause get generated

for each tuple t1 in relation r1
  for each tuple t2 in relation r2
    . . .
    for each tuple tm in relation rm
      Concatenate t1, t2, . . . , tm into a single tuple t
      Add t into the result relation

Another perspective to understand cartesian product is from relation algebra cross-product $R \times S$, which is defined as: returns a relation instance whose schema contains all the fields of $R$ (in the same order as they appear in $R$) followed by all the fields of $S$ (in the same order as they appear in $S$). The result of $R \times S$ contains all tuples $(r,s)$ (the concatenation of tuples $r$ and $s$) for each pair of tuples $r \in R$, $s \in S$.

To see a concrete example, let's consider the following SQL

select * from teaches, instructor;

The teaches and instructor table look like below

sqlite> select * from teaches;
ID|course_id|sec_id|semester|year
10101|CS-101|1|Fall|2009
10101|CS-315|1|Spring|2010
10101|CS-347|1|Fall|2009
12121|FIN-201|1|Spring|2010
15151|MU-199|1|Spring|2010
22222|PHY-101|1|Fall|2009
32343|HIS-351|1|Spring|2010
45565|CS-101|1|Spring|2010
45565|CS-319|1|Spring|2010
76766|BIO-101|1|Summer|2009
76766|BIO-301|1|Summer|2010
83821|CS-190|1|Spring|2009
83821|CS-190|2|Spring|2009
83821|CS-319|2|Spring|2010
98345|EE-181|1|Spring|2009

sqlite> select * from instructor;
ID|name|dept_name|salary
10101|Srinivasan|Comp. Sci.|65000
12121|Wu|Finance|90000
15151|Mozart|Music|40000
22222|Einstein|Physics|95000
32343|El Said|History|60000
33456|Gold|Physics|87000
45565|Katz|Comp. Sci.|75000
58583|Califieri|History|62000
76543|Singh|Finance|80000
76766|Crick|Biology|72000
83821|Brandt|Comp. Sci.|92000
98345|Kim|Elec. Eng.|80000

teaches table has 15 rows; instructor table has 12 rows. Then, if we run our SQL above, we get our result looks like

sqlite> select * from instructor, teaches limit 20;
ID|name|dept_name|salary|ID|course_id|sec_id|semester|year
10101|Srinivasan|Comp. Sci.|65000|10101|CS-101|1|Fall|2009
10101|Srinivasan|Comp. Sci.|65000|10101|CS-315|1|Spring|2010
10101|Srinivasan|Comp. Sci.|65000|10101|CS-347|1|Fall|2009
10101|Srinivasan|Comp. Sci.|65000|12121|FIN-201|1|Spring|2010
10101|Srinivasan|Comp. Sci.|65000|15151|MU-199|1|Spring|2010
10101|Srinivasan|Comp. Sci.|65000|22222|PHY-101|1|Fall|2009
10101|Srinivasan|Comp. Sci.|65000|32343|HIS-351|1|Spring|2010
10101|Srinivasan|Comp. Sci.|65000|45565|CS-101|1|Spring|2010
10101|Srinivasan|Comp. Sci.|65000|45565|CS-319|1|Spring|2010
10101|Srinivasan|Comp. Sci.|65000|76766|BIO-101|1|Summer|2009
10101|Srinivasan|Comp. Sci.|65000|76766|BIO-301|1|Summer|2010
10101|Srinivasan|Comp. Sci.|65000|83821|CS-190|1|Spring|2009
10101|Srinivasan|Comp. Sci.|65000|83821|CS-190|2|Spring|2009
10101|Srinivasan|Comp. Sci.|65000|83821|CS-319|2|Spring|2010
10101|Srinivasan|Comp. Sci.|65000|98345|EE-181|1|Spring|2009
12121|Wu|Finance|90000|10101|CS-101|1|Fall|2009
12121|Wu|Finance|90000|10101|CS-315|1|Spring|2010
12121|Wu|Finance|90000|10101|CS-347|1|Fall|2009
12121|Wu|Finance|90000|12121|FIN-201|1|Spring|2010
12121|Wu|Finance|90000|15151|MU-199|1|Spring|2010

The result relation has 180 rows, which exactly equal to $12 \times 15$.

Quite often, we use where clause to restrict the combinations created by the cartesian product to those that are meaningful for the desired answer. For example:

select name, course id
from instructor, teaches
where instructor.ID = teaches.ID;

In this query, we combine information from the instructor and teaches table and the matching condition requires instructor.ID to be equal to teaches.ID. In fact, these are the only attributes in the two relations that have the same name. In general, we may often find us writing SQLs that requires all attributes with matching names to be equated in the where clause. This case is so common that we use join to save us some effort.

join

In this section, I'll talk about join from SQL perspective, and then I'll also present how join is actually defined in relational algebra.

SQL perspective

There are two basic types of join: inner join and outer join. inner keyword is optional. In other words, if only join appears in the SQL statement, we usually assume it to be inner join. Under outer join, we can further specify whether it is left outer join, right outer join, or full outer join. Here is a graphic summary for the text above

- (inner) join
- outer join
   |- left outer join
   |- right outer join
   |- full outer join

In addition, there are join conditions that we can use in combination with the join form mentioned above. Any form of join (inner, left outer, right outer, or full outer) can be combined with any join condition (natural, using, or on). The table below provides a summary of join types and join conditions

Join types	Join conditions
inner join	natural
left outer join	on
right outer join	using ($A_1, A_2, \dots, A_n$)
full outer join

Then the SQL syntax is

[table1] <natural> [Join types] [table2] <on | using>

inner join

natural join

We start this section by considering natural join. The natural join works on two relations and produces a relation as the result. Unlike the cartesian product of two relations, which concatenates each tuple of the first relation with every tuple of the second, natural join considers only those pairs of tuples with the same value on those attributes that appear in the schemas of both relations.

select * from instructor natural join teaches;

Consider the query above, computing instructor natural join teaches considers only those pairs of tuples where both the tuple from instructor and the tuple from teaches have the same value on the common attribute, ID.

sqlite> select * from instructor natural join teaches;
ID|name|dept_name|salary|course_id|sec_id|semester|year
10101|Srinivasan|Comp. Sci.|65000|CS-101|1|Fall|2009
10101|Srinivasan|Comp. Sci.|65000|CS-315|1|Spring|2010
10101|Srinivasan|Comp. Sci.|65000|CS-347|1|Fall|2009
12121|Wu|Finance|90000|FIN-201|1|Spring|2010
15151|Mozart|Music|40000|MU-199|1|Spring|2010
22222|Einstein|Physics|95000|PHY-101|1|Fall|2009
32343|El Said|History|60000|HIS-351|1|Spring|2010
45565|Katz|Comp. Sci.|75000|CS-101|1|Spring|2010
45565|Katz|Comp. Sci.|75000|CS-319|1|Spring|2010
76766|Crick|Biology|72000|BIO-101|1|Summer|2009
76766|Crick|Biology|72000|BIO-301|1|Summer|2010
83821|Brandt|Comp. Sci.|92000|CS-190|1|Spring|2009
83821|Brandt|Comp. Sci.|92000|CS-190|2|Spring|2009
83821|Brandt|Comp. Sci.|92000|CS-319|2|Spring|2010
98345|Kim|Elec. Eng.|80000|EE-181|1|Spring|2009

Since join is really a shorthand of writing a type of SQL with cartesian product, we can get the same result using select * from instructor, teaches where instructor.ID = teaches.ID;. From above query result we can see that:

We do not repeat those attributes that appear in the schemas of both relations; rather they appear only once (e.g. only one ID column, not two).
The order in which the attributes are listed: first the attributes common to the schemas of both relations, second those attributes unqiue to the schema of the first relation, and finally, those attribute unique to the schema of the second relation.
All the columns from instructor table (4 columns) and teaches (5 columns) table show up in the final result (8 columns in total except ID column showing up once).

In addition, natural join will consider ALL the attributes that appear in the schemas of both relations. Consider the following query

select *
from instructor natural join teaches natural join course;

course table looks like the following

sqlite> select * from course;
course_id|title|dept_name|credits
BIO-101|Intro. to Biology|Biology|4
BIO-301|Genetics|Biology|4
BIO-399|Computational Biology|Biology|3
CS-101|Intro. to Computer Science|Comp. Sci.|4
CS-190|Game Design|Comp. Sci.|4
CS-315|Robotics|Comp. Sci.|3
CS-319|Image Processing|Comp. Sci.|3
CS-347|Database System Concepts|Comp. Sci.|3
EE-181|Intro. to Digital Systems|Elec. Eng.|3
FIN-201|Investment Banking|Finance|3
HIS-351|World History|History|3
MU-199|Music Video Production|Music|3
PHY-101|Physical Principles|Physics|4

instructor has attributes (ID|name|dept_name|salary); teaches has attributes (ID|course_id|sec_id|semester|year); course has attributes (course_id|title|dept_name|credits). The first natural join will first do cartesian product of instructor and teaches and keep the tuples that have the same value on ID. Then, the resulting relation will do the second natural join with course and will keep the tuples that have the same value on course_id and dept_name.

sqlite> select * from instructor natural join teaches natural join course;
ID|name|dept_name|salary|course_id|sec_id|semester|year|title|credits
10101|Srinivasan|Comp. Sci.|65000|CS-101|1|Fall|2009|Intro. to Computer Science|4
10101|Srinivasan|Comp. Sci.|65000|CS-315|1|Spring|2010|Robotics|3
10101|Srinivasan|Comp. Sci.|65000|CS-347|1|Fall|2009|Database System Concepts|3
12121|Wu|Finance|90000|FIN-201|1|Spring|2010|Investment Banking|3
15151|Mozart|Music|40000|MU-199|1|Spring|2010|Music Video Production|3
22222|Einstein|Physics|95000|PHY-101|1|Fall|2009|Physical Principles|4
32343|El Said|History|60000|HIS-351|1|Spring|2010|World History|3
45565|Katz|Comp. Sci.|75000|CS-101|1|Spring|2010|Intro. to Computer Science|4
45565|Katz|Comp. Sci.|75000|CS-319|1|Spring|2010|Image Processing|3
76766|Crick|Biology|72000|BIO-101|1|Summer|2009|Intro. to Biology|4
76766|Crick|Biology|72000|BIO-301|1|Summer|2010|Genetics|4
83821|Brandt|Comp. Sci.|92000|CS-190|1|Spring|2009|Game Design|4
83821|Brandt|Comp. Sci.|92000|CS-190|2|Spring|2009|Game Design|4
83821|Brandt|Comp. Sci.|92000|CS-319|2|Spring|2010|Image Processing|3
98345|Kim|Elec. Eng.|80000|EE-181|1|Spring|2009|Intro. to Digital Systems|3

join ... using

Automatically equate all the attributes with the same name from both schemas of relations may be too strong. Quite often, we may want to do natural join on specific subsets of the common-shared attributes. This leads to our join condition grammar: join ... using ($A_1, A_2, \dots, A_n$). The operation requires a list of attribute names to be specified. Both inputs must have attributes with the specified names. Consider the operation $r_1$ join $r_2$ using ( $A_1, A_2$ ). The operation is similar to $r_1$ natural join $r_2$, execpt that a pair of tuples $t_1$ from $r_1$ and $t_2$ from $r_2$ match if $t_1.A_1 = t_2.A_1$ and $t_1.A_2 = t_2.A_2$; even if $r_1$ and $r_2$ both have an attribute named $A_3$, it is not required that $t_1.A_3 = t_2.A_3$.

An example query look like

select name, title
from (instructor natural join teaches) join course using (course_id);

join ... on

Another form of join condition is the on condition, which allows a general predicate over the relations being joined. The predicte is written like a where clause predicate except for the use of the keyword on rahter than where.

For example, the below two queries are equivalent with each other (i.e. gives the same result)

select * from student join takes on student.ID = takes.ID;

select * from student, takes where student.ID = takes.ID;

Note

The query is almost the same as using student natural join takes, except that the "ID" columns twices in the result set.

One question we may ask is why do we need this on operation if it may look like working exactly the same as where clause? The answer is

on conditions behaves different from where conditions when we work with outer join.
SQL query is often more readable by humans if the join condition is specified in the on clause and the rest of the conditions appear in the where clause.

outer join

Motivation

Let's consider student and takes tables look like the below

sqlite> select * from student;
ID|name|dept_name|tot_cred
00128|Zhang|Comp. Sci.|102
12345|Shankar|Comp. Sci.|32
19991|Brandt|History|80
23121|Chavez|Finance|110
44553|Peltier|Physics|56
45678|Levy|Physics|46
54321|Williams|Comp. Sci.|54
55739|Sanchez|Music|38
70557|Snow|Physics|0
76543|Brown|Comp. Sci.|58
76653|Aoi|Elec. Eng.|60
98765|Bourikas|Elec. Eng.|98
98988|Tanaka|Biology|120

sqlite> select * from takes;
ID|course_id|sec_id|semester|year|grade
00128|CS-101|1|Fall|2009|A
00128|CS-347|1|Fall|2009|A-
12345|CS-101|1|Fall|2009|C
12345|CS-190|2|Spring|2009|A
12345|CS-315|1|Spring|2010|A
12345|CS-347|1|Fall|2009|A
19991|HIS-351|1|Spring|2010|B
23121|FIN-201|1|Spring|2010|C+
44553|PHY-101|1|Fall|2009|B-
45678|CS-101|1|Fall|2009|F
45678|CS-101|1|Spring|2010|B+
45678|CS-319|1|Spring|2010|B
54321|CS-101|1|Fall|2009|A-
54321|CS-190|2|Spring|2009|B+
55739|MU-199|1|Spring|2010|A-
76543|CS-101|1|Fall|2009|A
76543|CS-319|2|Spring|2010|A
76653|EE-181|1|Spring|2009|C
98765|CS-101|1|Fall|2009|C-
98765|CS-315|1|Spring|2010|B
98988|BIO-101|1|Summer|2009|A
98988|BIO-301|1|Summer|2010|

Suppose we wish to display a list of all students along with the courses they have taken. We come up a query that looks like

select *
from student natural join takes;

This query actually is not right for our purpose because it will not show the student who takes no course. This is because his ID will only appear in student table not in takes table. If we do natural join, the value of ID will not equal (one is a number and the other is null), which will not show up in our final result set. Example in our case will be student Snow with ID 70557, who has not taken any course.

sqlite> select * from student natural join takes;
ID|name|dept_name|tot_cred|course_id|sec_id|semester|year|grade
00128|Zhang|Comp. Sci.|102|CS-101|1|Fall|2009|A
00128|Zhang|Comp. Sci.|102|CS-347|1|Fall|2009|A-
12345|Shankar|Comp. Sci.|32|CS-101|1|Fall|2009|C
12345|Shankar|Comp. Sci.|32|CS-190|2|Spring|2009|A
12345|Shankar|Comp. Sci.|32|CS-315|1|Spring|2010|A
12345|Shankar|Comp. Sci.|32|CS-347|1|Fall|2009|A
19991|Brandt|History|80|HIS-351|1|Spring|2010|B
23121|Chavez|Finance|110|FIN-201|1|Spring|2010|C+
44553|Peltier|Physics|56|PHY-101|1|Fall|2009|B-
45678|Levy|Physics|46|CS-101|1|Fall|2009|F
45678|Levy|Physics|46|CS-101|1|Spring|2010|B+
45678|Levy|Physics|46|CS-319|1|Spring|2010|B
54321|Williams|Comp. Sci.|54|CS-101|1|Fall|2009|A-
54321|Williams|Comp. Sci.|54|CS-190|2|Spring|2009|B+
55739|Sanchez|Music|38|MU-199|1|Spring|2010|A-
76543|Brown|Comp. Sci.|58|CS-101|1|Fall|2009|A
76543|Brown|Comp. Sci.|58|CS-319|2|Spring|2010|A
76653|Aoi|Elec. Eng.|60|EE-181|1|Spring|2009|C
98765|Bourikas|Elec. Eng.|98|CS-101|1|Fall|2009|C-
98765|Bourikas|Elec. Eng.|98|CS-315|1|Spring|2010|B
98988|Tanaka|Biology|120|BIO-101|1|Summer|2009|A
98988|Tanaka|Biology|120|BIO-301|1|Summer|2010|

More generally, some tuples in either or both of the relations being joined may be "lost" in this way. The outer join operation works in a manner similar to the join operations we studied above, but preserve those tuples that would be lost in a join, by creating tuples in the result containing null values.

outer join

There are three forms of outer join:

The left outer join preserves tuples only in the relation named before (to the left of) the left outer join operation.
The right outer join preserves tuples only in the relation named after (to the right of) the right outer join operation.
The full outer join preserves tuples in both relations.

For our example, the actual query should be

select * 
from student natural left outer join takes;

which returns result that includes student Snow with nulls for the attributes that appear only in the schema of the take relation

sqlite> select * from student natural left outer join takes;
ID|name|dept_name|tot_cred|course_id|sec_id|semester|year|grade
00128|Zhang|Comp. Sci.|102|CS-101|1|Fall|2009|A
00128|Zhang|Comp. Sci.|102|CS-347|1|Fall|2009|A-
12345|Shankar|Comp. Sci.|32|CS-101|1|Fall|2009|C
12345|Shankar|Comp. Sci.|32|CS-190|2|Spring|2009|A
12345|Shankar|Comp. Sci.|32|CS-315|1|Spring|2010|A
12345|Shankar|Comp. Sci.|32|CS-347|1|Fall|2009|A
19991|Brandt|History|80|HIS-351|1|Spring|2010|B
23121|Chavez|Finance|110|FIN-201|1|Spring|2010|C+
44553|Peltier|Physics|56|PHY-101|1|Fall|2009|B-
45678|Levy|Physics|46|CS-101|1|Fall|2009|F
45678|Levy|Physics|46|CS-101|1|Spring|2010|B+
45678|Levy|Physics|46|CS-319|1|Spring|2010|B
54321|Williams|Comp. Sci.|54|CS-101|1|Fall|2009|A-
54321|Williams|Comp. Sci.|54|CS-190|2|Spring|2009|B+
55739|Sanchez|Music|38|MU-199|1|Spring|2010|A-
70557|Snow|Physics|0|||||
76543|Brown|Comp. Sci.|58|CS-101|1|Fall|2009|A
76543|Brown|Comp. Sci.|58|CS-319|2|Spring|2010|A
76653|Aoi|Elec. Eng.|60|EE-181|1|Spring|2009|C
98765|Bourikas|Elec. Eng.|98|CS-101|1|Fall|2009|C-
98765|Bourikas|Elec. Eng.|98|CS-315|1|Spring|2010|B
98988|Tanaka|Biology|120|BIO-101|1|Summer|2009|A
98988|Tanaka|Biology|120|BIO-301|1|Summer|2010|

We mention earlier that on and where behave differently for outer join. Let's consider the following example

select *
from student left outer join takes on true
where student.ID = takes.ID

This gives the following result ²

sqlite> select * from student left outer join takes where student.ID = takes.ID;
ID|name|dept_name|tot_cred|ID|course_id|sec_id|semester|year|grade
00128|Zhang|Comp. Sci.|102|00128|CS-101|1|Fall|2009|A
00128|Zhang|Comp. Sci.|102|00128|CS-347|1|Fall|2009|A-
12345|Shankar|Comp. Sci.|32|12345|CS-101|1|Fall|2009|C
12345|Shankar|Comp. Sci.|32|12345|CS-190|2|Spring|2009|A
12345|Shankar|Comp. Sci.|32|12345|CS-315|1|Spring|2010|A
12345|Shankar|Comp. Sci.|32|12345|CS-347|1|Fall|2009|A
19991|Brandt|History|80|19991|HIS-351|1|Spring|2010|B
23121|Chavez|Finance|110|23121|FIN-201|1|Spring|2010|C+
44553|Peltier|Physics|56|44553|PHY-101|1|Fall|2009|B-
45678|Levy|Physics|46|45678|CS-101|1|Fall|2009|F
45678|Levy|Physics|46|45678|CS-101|1|Spring|2010|B+
45678|Levy|Physics|46|45678|CS-319|1|Spring|2010|B
54321|Williams|Comp. Sci.|54|54321|CS-101|1|Fall|2009|A-
54321|Williams|Comp. Sci.|54|54321|CS-190|2|Spring|2009|B+
55739|Sanchez|Music|38|55739|MU-199|1|Spring|2010|A-
76543|Brown|Comp. Sci.|58|76543|CS-101|1|Fall|2009|A
76543|Brown|Comp. Sci.|58|76543|CS-319|2|Spring|2010|A
76653|Aoi|Elec. Eng.|60|76653|EE-181|1|Spring|2009|C
98765|Bourikas|Elec. Eng.|98|98765|CS-101|1|Fall|2009|C-
98765|Bourikas|Elec. Eng.|98|98765|CS-315|1|Spring|2010|B
98988|Tanaka|Biology|120|98988|BIO-101|1|Summer|2009|A
98988|Tanaka|Biology|120|98988|BIO-301|1|Summer|2010|

Here, left outer join esentially returns a cartesian product of two relations. Since there is no tuple in take with ID = 70577, every time a tuple appears in the outer join with name = "Snow", the values for student.ID and takes.ID must be different, and such tuples would be eliminated by the where clause predicate. Thus, student Snow never appears in the result of the latter query.

Relational algebra perspective

In the previous section, we spend quite some time understanding join from SQL perspective. In this section, we try to under the clause from relational algebra perspective.

In relational algebra, a operator accepts (one or two) relation instances as argument and returns a relation instance as the result: $f(R_1, R_2) \to R_3$. We have the following operators:

$\sigma$: select rows from a relation (i.e. $\sigma_{\text{grade} < B}(takes)$)
$\pi$: extract columns from a relation (i.e. $\pi_{\text{ID, name}}(student)$)

Then, for join we have following definitions:

condition joins: $R \bowtie_c S = \sigma_c (R \times S)$ (i.e. $\text{student} \bowtie_{\text{student.id < takes.id}} \text{takes}$)

Does this look familar to you? Yes, this exactly corresponds to join ... on ... usage.

equijoin: $R \bowtie_c S$ ( $c$ consists solely of equalities) (i.e. $\text{student} \bowtie_{\text{student.id = takes.id}} \text{takes}$)

This is equivalent to join ... using (...) usage.

natural join: $R \bowtie S$ (equijoin where equalities are specified on all fields having the same name in $R$ and $S$) (i.e. $\text{student} \bowtie \text{takes}$).

Well, this is exactly the natural join usage.

In fact, this perspective is acutally how people explain join to others. There are two excellent pages offer graphical explaination to this concept. Their links are attached in the "Links to resources" section.

Examples

Leetcode 175

Now, let's take a look at leetcode 175. Combine Two tables for practice.

The problem asks us to

Write a SQL query for a report that provides the following information (FirstName, LastName, City, State) for each person in the Person table, regardless if there is an address for each of those people.

"regardless if there is an address for each of those people" is a clear indicator for us to use outer join because we still want to keep all the person even they may not appear in the "Address" table. There are several queries we can write

# Solution 1
select FirstName, LastName, City, State from Person natural left outer join Address

# Solution 2
select FirstName, LastName, City, State from Person left outer join Address on Person.PersonId = Address.PersonId

# Solution 3
select FirstName, LastName, City, State from Person left outer join Address using (PersonId)

# Solution 4
select FirstName, LastName, City, State from Address right outer join Person using (PersonId)

# Solution 5
select FirstName, LastName, City, State from Address right outer join Person on Address.PersonId = Person.PersonId

# Solution 6
select FirstName, LastName, City, State from Address natural right outer join Person

We should have no trouble to understand these queries now.

Example Two

Suppose we have two tables t1 and t2 that are created like:

create table t1 (a int, b char(1));
create table t2 (b char(1), c char(10));

Also, t1.b contains 7 'x' and t2.b contains 3 'x'. If we do select * from t1 inner join t2 on t1.b = t2.b:

what are the columns of the resulting queries?

The result contains 4 columns: a, b, b, c. Note that one difference between join ... on and natural join or join ... using is that join ... on will keep the columns with same attributes from two tables (e.g. b in this case) while the other two will only keep one column only.

How many rows does the result set contain?

Remember join is a special case of cartesian product: only keep the rows that have the same value on the shared attributes between two tables. In this example, since there are 7 'x' in t1 and 3 'x' in t2, we will have 21 rows in the end.

Links to resources

Here are some of the resources I found helpful while preparing this article:

Database System Concepts Chapter 3, 4
DB2 10.1 fundamentals certification exam 610 prep, Part 4
Database Management Systems Chapter 4
Code project page and SO post have nice graphic explanation to this concept.
What is natural join in SQLite?

Tables appeard in this post are from the supplementary resources of the Database System Concepts book. ↩
on true is equivalent to student left outer join takes. ↩

The tortoise and the hare

2017-06-18T20:20:00+08:00

Recently, I start to work on leetcode's problems. My goal is to solve two problems per day (mission possible, right?). The problems I'm looking at are 142. Linked List Cycle II and 287. Find the Duplicate Number, which both can be solved by Folyd's Tortoise and Hare algorithm.

This post will try to take a deeper look at the correctness of the algorithm and how to apply it to solve problems.

Introduction

Floyd's Tortoise and Hare algorithm is used with three purposes under the context of linked list:

Detect whether there is a cycle in the list
Find the starting point of the cycle (i.e. list 1->4->3->4, starting point is 4)
Decide the length of the cycle (i.e. 2 for above example)

The algorithm idea is following:

We use two pointers: tortoise and hare. Both start at the beginning of the list. Hare runs twice as fas the tortoise.
If there is no-cycle, then hare will reach the finish line before the tortoise.
If there is a cycle, then hare will always be ahead and eventually he would so far ahead that he laps the tortoise. That's the place we know we have a cycle in the list.
Once we detect the cycle, we send hare back to the beginning and advance both of them at the same speed until they meet again. The second meeting place, which we'll prove immediately, is the entry point of the cycle.
Then, one of them will keep moving to finish the victory lap to find the period of the cycle.

The key difference when the list has a cycle is that at some point on the track, the hare will be at the same spot as the tortoise ...

Proof of correctness

Let $\mu$ be the index of the start of the cycle, and let $\lambda$ be period of the cycle. Let $i$ be the distance (i.e number of nodes) that tortoise travels and let $x_i$ denotes the index of the node at which both tortoise and hare meet. $x_0$ is the first node in the list.

Note

The notation here is similar to the concept in physics: distance vs. displacement. $i$ is the "distance" or the number of nodes that our character (tortoise or hare) has travelled since the beginning of the list and $x_i$ is the "displacement" between the first node and the current node that our characters are at.

The key observation for showing the correctness of the algorithm lies in the following fact:

$$ \begin{equation} x_{j+k\lambda} = x_j \text{ for all integers }j \ge \mu \text{ and } k \ge 0 \label{eqn:1} \end{equation} $$

This statement says that going around the loop any number of times takes you back to the same places as long as you start somewhere on the loop. Let's define the following set of notation here for future use:

$y$ be the displacement between $x_{\mu}$ and $x_i$
$m$ be the number of laps that tortoise have travelled before he meets with hare at $x_i$
$n$ be the number of laps that hare have travelled before he meets with tortoise at $x_i$

Since the hare runs twice as fast as the tortoise, then the distance hare travelled when he meets with tortoise is $2i$ ¹. Then, we have the following set of equations

$$ \begin{eqnarray} i = \mu + y + m \cdot \lambda \label{eqn:2} \\ 2i = \mu + y + n \cdot \lambda \label{eqn:3} \end{eqnarray} $$

Now we subtract \ref{eqn:2} from \ref{eqn:3} and we have

$$ \begin{equation} i = (n-m) \cdot \lambda \label {eqn:4} \end{equation} $$

Let's revisit our key observation \ref{eqn:1} and set $j = \mu$ and $k = (n-m)$, we have $x_{\mu + (n-m)\lambda} = x_{\mu}$. Then, by \ref{eqn:4}, we have $x_{\mu + i} = x_{\mu}$, which can be rewritten as $x_{i+\mu} = x_{\mu}$!!! This equation tells us that the node at which the cycle begins (i.e $x_{\mu}$) is exactly the same node as the node that is $\mu$ nodes away from the index at which tortoise and hare meet (i.e. $x_i$).

Note

The proof can be much shorter once we have $x_{2i} = x_i$. By \ref{eqn:1}, we also have $x_{i+k\lambda} = x_{2i}$, which leads to $k\lambda = i$. Since $x_\mu$ also meets the condition of \ref{eqn:1}, we have $x_{\mu + k\lambda} = x_\mu$. Substitutes $i = k\lambda$ in and get $x_{\mu+i} = x_\mu$. The conclusion follows.

Two problems

142. Linked List Cycle II

The first problem is a straightforward application of the algorithm: Given a linked list, return the node where the cycle begins. If there is no cycle, return NULL. The code is following

/**
 * Definition for singly-linked list.
 * struct ListNode {
 *     int val;
 *     struct ListNode *next;
 * };
 */
struct ListNode *detectCycle(struct ListNode *head) {
    if(head == NULL || head->next == NULL)
        return NULL;

    struct ListNode *tortoise;
    struct ListNode *hare;
    struct ListNode *curr;

    tortoise = hare = curr = head;

    while(hare != NULL && hare->next != NULL)
    {
        hare = hare->next->next;
        tortoise = tortoise->next;
        if (hare == tortoise)  // there is a cycle
        {
            while(curr != tortoise)
            {
                curr = curr->next;
                tortoise = tortoise->next;
            }
            return curr;  // find the entry location
        }
    }
    return NULL; // there is no cycle
}

287. Find the Duplicate Number

This problem is a lot tricker than a previous one: we need identify this problem can also be solved by the Floyd's Tortoise and Hare algorithm, which is not obvious at first glance

Given an array nums containing $n + 1$ integers where each integer is between $1$ and $n$ (inclusive), prove that at least one duplicate number must exist. Assume that there is only one duplicate number, find the duplicate one. note: There is only one duplicate number in the array, but it could be repeated more than once.

The key point is to identify that the problem description is another way of describing a linked list, which requires somewhat deeper understanding of the algorithm itself.

The algorithm is, in fact, used to find a cycle in a sequence of iterated function values:

$$ x_0, x1 = f(x_0), x_2 = f(x_1), \dots, x_i = f(x_{i-1}), \dots $$

For example, the sequence $1,3,4,2,1$ can be considered as a sequence of iterated function values with $x_0 = 1, x_1 = f(1) = 3, x_2 = f(3) = 4, x_3 = f(4) = 2, x_4 = f(2) = 1$. Let's try another representation:

index	0	1	2	3	4
value	1	3	4	2	1

Surprisingly, the function $f$ simply map the index to the corresponding values. With this table, the above sequence can be converted as a linked list:

0 -> 1 -> 3 -> 2 -> 4
     ^              |
     |--------------|

This list is constructed by the definition of a sequence of iterated function values. The arrow (->) is the function $f$. Then, we can apply our algorithm to solve this problem:

int findDuplicate(int* nums, int numsSize) {
  int tortoise;
  int hare;
  tortoise = nums[0];
  hare = nums[nums[0]];
  while (tortoise != hare)
  {
    tortoise = nums[tortoise];
    hare = nums[nums[hare]];
  }
  hare = 0;
  while (tortoise != hare)
  {
    tortoise = nums[tortoise];
    hare = nums[hare];
  }
  return hare;
}

This implementation slightly deviates from the algorithm description above:

Tortoise and hare don't start from the same place at the beginning. This doesn't matter really because eventually they will be in the loop.
We use tortoise = nums[tortoise] instead of tortoise++ for advancing tortoise, for example. This is the place where "a sequence of iterative function values" idea appears. In fact, this is also how we constructed our linked list.
hare = 0 not hare = nums[0] can be confusing. We can think about this from our linked list representation: our list starts from $0$ (required by $f$, which maps index to value) and if we starts from hare = nums[0], that violates our algorithm.

$x_{2i} = x_i$ immediately follows this statement. Then, if we let
$l$ be the number of laps by which hare is ahead, then $2i = i + l \cdot \lambda$ and we have $i = l\lambda$. Then we set $k=l$ in \ref{eqn:1} and reach the same conclusion. This way we don't need to define $y$,$m$,$n$, which can make proof a little simpler notation-wise. ↩

Python case study: leetcode scraper

2017-06-15T21:22:00+08:00

It has been many years since last time I touched python. Things get very rusty. Recently, I have been practicing my algorithm skills on leetcode and I keep all my solutions in a github repo. I want my source files have consistency formatting shown below

/*
 * [Source]
 * 
 * https://leetcode.com/problems/same-tree/
 *
 * [Problem Description]
 *
 * Given two binary trees, write a function to check if they are equal or not.
 * 
 * Two binary trees are considered equal if they are structurally identical 
 * and the nodes have the same value. 
 *
 * [Companies]
 */

 // Source code begins here ...

The overhead of adding this header comment can be quite large. So, I ask myself if there is a better way to make the whole process automated as much as possible. Python and its famous beautifulsoup library ¹ immediately come into my mind.

In this post, I'll highlight some python usage appeared in the script, which costs me quite some time on googling. Please leave your comment if you find any non-pythonic usage. The code script is available here. I'll use 92. ReverseLinkedList II leetcode page as a working example to demonstrate the python techniques.

#!/usr/bin/env python3.6
# -*- coding: utf-8 -*-

The very first thing is "shebang". This is important for our task because the web page is often written in the unicode (i.e. mathematical symbols). This shebang will help us avoid unicode & ascii madness.

from bs4 import BeautifulSoup
import requests
import sys

We use a lot of libraries through import. If I use import module, I have to use quantifier for any module function call (i.e. sys.exit()). By the contrast, I can directly call the module function if I do from module import. This brings a question on when to use which. Here, I want to quote the explanation from Dive Into Python

When should you use from module import?

If you will be accessing attributes and methods often and don't want to type the module name over and over, use "from module import".

If you want to selectively import some attributes and methods but not others, use "from module import".

If the module contains attributes or functions with the same name as ones in your module, you must use "import module" to avoid name conflicts.

The author makes extra remark: Use from module import * sparingly, because it makes it difficult to determine where a particular function or attribute came from, and that makes debugging and refactoring more difficult.

script, url = sys.argv
print('url is {:s}'.format(url))

I used to really like Python2.7 and not a big fan of Python3. However, with python2.7 EOS, change must be made. the print statement is how we do format printing in python3.

r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")

Here, I use requests library to fetch the content of the url and then feed it into our BeautifulSoup with parser lxml.

The next step is to actual scrap the data from leetcode page. The first thing I do is to get the question title. Leetcode page has the following structure for the question title

   <div class="question-title clearfix">
      <div class="row">

        <div class="col-lg-4 col-md-5 col-sm-6 col-sm-push-6 col-md-push-7 col-lg-push-8" id="widgets">
          <div class="like-and-dislike">
            <div id="question-like"></div>
          </div>
          <div class="add-to-list">
            <div id="add-to-favorite"></div>
          </div>
        </div>
        <div class="col-lg-8 col-md-7 col-sm-6 col-sm-pull-6 col-md-pull-5 col-lg-pull-4">
          <h3>
            92. Reverse Linked List II
          </h3>
        </div>

      </div>
    </div>

As you can see, the question title ("92. Reverse Linked List II") is wrapped around by the <div> tag with class name question-title.

title_corp = soup.find_all("div", class_="question-title")
title_raw = title_corp[0].h3.get_text()

So, we invoke find_all method from beautiful soup to find all the <div></div> tags with class name question-title. Fortunately, question-title class appears only once in the whole html page. That allows us to directly access its using title_corp[0]. In addition, as you can see from html source code above, <h3></h3> appears only once and it wraps our problem title. So, we can directly access the content of <h3></h3> tags by title_corp[0].h3.get_text().

Note

find_all returns a "ResultSet" object in beautifulSoup. This object contains a set of tags that match with find_all function argument criteria. In our case, our criteria is <div></div> tag with class name question-title.

Now, once we have the title string, we want to process it into our desired form. Our scraper script will go into the leetcode directory of the shuati repo and create the question directory with the format "[question number]-[question title in mixed case with the first letter of each internal word capitalized]" For example, "92. Reverse Linked List II" will lead to a directory ./leetcode/92-ReverseLinkedListII. The source file name is similar to the directory name: reverseLinkedListII.c. That's what following code chunk tries to achieve

    title_lines = title_raw.split('\n')
    title_lines = list(filter(operator.methodcaller('strip'), title_lines))
    title_rdy = title_lines[0].lstrip(' ').replace(".", "-").split(' ')
    title = "".join(title_rdy)

    path = "./leetcode/" + title
    os.mkdir(path)

title_lines = title_raw.split('\n') will split the whole text into a list of strings with each string being a line of code. In our case, this will give ['', ' 92. Reverse Linked List II', ' '].

As you can see our result contains empty string, string with multiple leading whitespaces, and string with only whitespaces. We need to do some cleanup to keep only the question title. The first thing we do is to take out the empty string and the string with only whitespaces. This is done by title_lines = list(filter(operator.methodcaller('strip'), title_lines)) ². filter creates a list of elements for which a function (the 1st argument of filter) returns true. operator.methodcaller('strip') uses methodcaller, which applies strip function to each element of title_lines. The function will return true only when our string has some characters in it. This will lead to [' 92. Reverse Linked List II'].

Note

Here is an example of methodcaller: After f = methodcaller('name', 'foo', bar=1), the call f(b) returns b.name('foo', bar=1). In our case, filter will apply operator.methodcaller('strip') on title_lines, which is basically title_lines.strip().

Now, we will work on our title string. title_rdy = title_lines[0].lstrip(' ').replace(".", "-").split(' ') removes leading whitespace (lstrip(' ')) and replace . with -, and then split our string into words: ['92-', 'Reverse', 'Linked', 'List', 'II']. We are ready to form our directory by join the words together (title = "".join(title_rdy)) and get 92-ReverseLinkedListII.

Our file name should look like reverseLinkedListII.c. This invloves a use of regular expression to get rid of 92- and convert the first character of the rest of string into lower case. The code is below

extension = ".c"
pat = re.compile(r"^(\d+)-")
m = re.search(pat, title)
filename=title[:m.start()] + title[m.end():]
filename=filename[0].lower() + filename[1:]
target = open(path+"/"+filename+extension, "w")

The regular expression is best illsutrated from a snippet taken from re library

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'

^ matches the beginning of the string and \d means numeric digits and + means at least once appearance (of \d). Just like official doc snippet above, filename=title[:m.start()] + title[m.end():] removes, for instance, 92- and leaves us ReverseLinkedListII ³. One thing to notice right now is that our filename has object type str, which is immutable. This means that we cannot edit the variable itself. filename=filename[0].lower() + filename[1:] is a typical way to handle immutable str object, which, in our case, lower the first character case and append it back to the rest of string.

The last point needs to notice is line = line.replace("\r", "").replace("\n", ""), which removes carriage return character (^M) and linux newline character.

That's it for the leetcode scraper. This is actually the first scraper I have ever written. It is not as hard as I imagined. I think that's majorly because of the powerful python language and its libraries.

Here is a good tutorial on beautifulSoup. ↩
This line is found from this SO post. ↩
I do a quick summary of regular expression in python. ↩

Solving recurrence relations (part 2)

2017-06-12T17:20:00+08:00

Several months ago, I breifly summarize the ways to solve recurrence relations. At the end of that post, I indicate that different types of recurrence relation may require different kinds of treatments to solve them. Thus, this post will be the first "Downloadable Content (DLC)" with the aim to solve the recurrence relation: $T(N) = 2T(N/2) + N$.

This recurrence relation comes from merge sort and the algorithm itself represents a classic divide-and-conquer strategy: in order to sort $N$ elements, we can sort $N/2$ elements first (i.e., divide the problem into smaller problems and solve recursively), and then we merge two sorted $N/2$ elements back into one $N$ sorted array (i.e., we patch toghter the answer in conquer phase.)

The exactly recurrence relation we try to solve is the following with assumption that $N$ is a power of 2:

$$ \begin{eqnarray*} T(1) &=& 1 \\ T(N) &=& 2T(N/2) + N \end{eqnarray*} $$

There are two ways to solve this recurrence relation:

Method 1: Construct a telescoping sum

The goal of this method is to construct a telescoping sum (i.e see telescope series to get a sense of telescoping) with the aim to find a relation between $T(N)$ and $T(1)$ (or the base cases, in general).

Let's work through our example above to demonstrate this method. We divide the recurrence relation through by $N$ and repeatively doing so for every possible $N$ (i.e. $N, N/2, N/4, \dots, 2, 1$) and see what we can get:

$$ \begin{eqnarray*} \frac{T(N)}{N} &=& \frac{T(N/2)}{N/2} + 1 \\ \frac{T(N/2)}{N/2} &=& \frac{T(N/4)}{N/4} + 1 \\ \frac{T(N/4)}{N/4} &=& \frac{T(N/8)}{N/8} + 1 \\ \vdots \\ \frac{T(2)}{2} &=& \frac{T(1)}{1} + 1 \\ \end{eqnarray*} $$

We add up all the equations: we add all of the terms on the left-hand side and set the result equal to the sum of all of the terms on the right-hand side. This leads to a telescoping sum: all the terms that appear on both sides get cancelled. For example, the term $T(N/2)/(N/2)$ appears on both sides and thus cancels. After everything is added, the final result is:

$$ \frac{T(N)}{N} = \frac{T(1)}{1} + \log N \cdot 1 $$

because all of the other terms cancel and there are $\log N$ equations, and so all the $1$s at the end of these equations add up to $\log N$.

Note

for this recurrence relation, it is necessary to divide through $N$ in order to get telescoping sum. However, how to construct telescoping sum is case by case. For instance, for a recurrence relation $NT(N) = (N+1)T(N-1) + 2cN$, we need to divide $N(N+1)$. For a recurrence relation $T(N) = T(N-1) + cN$ ¹, we don't need to do any division. We just need to use the recurrence relation repeatively for different $N$ to construct the telescoping sum (i.e. $T(N-1) = T(N-2) + c(N-1)$, $T(N-2) = T(N-3) + c(N-2)$, and so on.)

Method 2: Iteratively substitute

For this method, we continuely substitute the recurrence relation on the right-hand side with the hope to find a pattern of the general solution to the recurrence relation.

We have

$$ \begin{eqnarray*} T(N) &=& 2T(N/2) + N \\ T(N/2) &=& 2T(N/4) + N/2 \end{eqnarray*} $$

Then, we substitute the second equation back into the first equation's right-hand side and we get:

$$ \begin{eqnarray} T(N) &=& 2(2T(N/4)+N/2) + N \nonumber \\ &=& 4T(N/4) + 2N \label{eqn:1} \end{eqnarray} $$

Now, we can substitute $N/4$ into the main equation, we see that

$$ \begin{eqnarray} T(N) &=& 4(2T(N/8)+N/4) + 2N \nonumber \\ &=& 8T(N/8) + 3N \label{eqn:2} \end{eqnarray} $$

We can continuing this substitution, and if we observe the \ref{eqn:1} and \ref{eqn:2} we can obtain the following pattern:

$$ T(N) = 2^kT(N/2^k) + k \cdot N $$

using $k = \log N$, we obtain

$$ T(N) = NT(1) + N \log N = N\log N + N $$

This recurrence relation is acutally a linear nonhomogeneous recurrence relation with constant coefficients. However, it cannot be solved by the method I write up in the last post. I have no clue why. This recurrence relation is taken from MAW p243. ↩

Draw a Neural Network through Graphviz

2017-05-25T22:20:00+08:00

Preface

Graphviz is a language (called DOT) and a set of tools to automatically generate graphs. It is widely used by researchers to do visualizations in papers. Essentially, you just need to provide a textual descritption of the graph regarding its topological structure (i.e. what nodes are, how they are connected, etc) and Graphviz will figure out the layout of the image by itself. Usually, the generated layout works out well but quite often, like this post mentioned, can be a "finicky beast". So, I decide to share some tips I learned about Graphviz.

Specifically, in this post, I'll demonstrate how we can draw the Neural Network shown in the last post and use this as an example to show some tricks in Graphviz to tweak the layout ¹. Let's get started!

Draw a neural network

If you do a quick search regarding "graphviz neural network example", you'll highly likely see the below picture:

This is probably the simplest Graphviz demonstration on Neural Network. The code for this picture can be obtained here.

However, when I'm preparing my last post, I'm not quite satisified with the example above. I want to clearly label all the nodes in all layers and make distinction among feature input, bias term, hidden units, and output units. So, I decide to draw one on my own.

Here is the code that generates the picture below ². Let me briefly highlights some key points in the code:

rankdir = LR;
splines=false;
edge[style=invis];

rankdir=LR makes the directed graphs drawn from left to right.
splines=false controls how the edges are represented and in this case, edges are drawn as line segments.
edge[style=invis] forces edges to become invisible. This is a common trick to tweak graphviz layout.

{
  node [shape=circle, color=yellow, style=filled, fillcolor=yellow];
  x0 [label=<x<sub>0</sub>>]; 
  a02 [label=<a<sub>0</sub><sup>(2)</sup>>]; 
  a03 [label=<a<sub>0</sub><sup>(3)</sup>>];
}

node[...] sets the default node property: specify the node shape, node color. This node property will apply to three nodes: x0, a02, a03.
x0 [label=<x<sub>0</sub>>] specify the text label for node x0. The text for label is specified in HTML-like and this is how we write subscript and superscript in Graphviz.
{...} specifies the scope of the node property. This code chunk as a whole shows how we can specify several nodes at the once with the same node property ³.

{
  rank=same;
  x0->x1->x2->x3;
}

rank=same is another trick I'll talk about later. This specifies what "layer" (or "rank" by official term) a set of nodes belongs. You can read the official doc for the details.
x0->...->x3 specifies the relative position of the four nodes. Since the graph is arranged from left to right (indicate by rankdir = LR), then the "layer" is vertical. Then by x0->...->x3, the first node will be x0, followed by x1, and so on. Also, we have edge[style=invis] and this will hide the edges among these four nodes. This is how we draw the NN layers.

a02->a03;

This line is used to prevent tilting of the graph. As you can see, we specify how the nodes should be arranged in a layer but we don't much constraint on how the layers should be positioned except rankdir=LR, which says layers should be ordered from left to right. a02->a03 says layer with a02 should be lined up with layer with a03.

l0 [shape=plaintext, label="layer 1 (input layer)"];
l0->x0;
{rank=same; l0;x0};

This code chunk is how we add label text to each layer. As you can see we use another node l0 with shape plaintext, which says l0 is just a text message. Then we connect it with the first node of layer 1 x0, which serves as attaching the text to the layer 1.

edge[style=solid, tailport=e, headport=w];

We specify the edge style again. This will only affect the edges after this setup not before. One small trick here is tailport=e, headport=w. This will let all the edges point to the same position.

{x0; x1; x2; x3} -> {a12;a22;a32;a42;a52};
{a02;a12;a22;a32;a42;a52} -> {a13;a23;a33;a43;a53};
{a03;a13;a23;a33;a43;a53} -> {O1,O2,O3,O4};

This code chunk is how we actually draw the edges. In the simple example above, it explicitly draws the edges between two nodes. It is quite pain to do. Above code chunk provides a simpler way to achieve the same purpose.

Graphviz tricks

From our NN drawing example, there are two recurring tricks when we tweak Graphviz picture layout:

Invisible nodes/edges
Rank constraints

Invisible nodes/edges

In the above example, we use invisible edges to specify the ordering of nodes within each NN layer. In addition, we use node with plaintext shape to specify the text label in the layer.

Usually, we use invisible edges to specify what nodes should line up and sometimes we use invisible nodes to take up space to keep the graph in a specific structure. This SO post demonstrates how we can use invisible nodes and edges in combination to create a fancy picture. This SO post is another example to show how to use "invisible edges" (it uses another trick called group attribute).

Rank constraints

If you check official doc, here is what rank does:

Rank constraints on the nodes in a subgraph. If rank="same", all nodes are placed on the same rank. If rank="min", all nodes are placed on the minimum rank. If rank="source", all nodes are placed on the minimum rank, and the only nodes on the minimum rank belong to some subgraph whose rank attribute is "source" or "min". Analogous criteria hold for rank="max" and rank="sink". (Note: the minimum rank is topmost or leftmost, and the maximum rank is bottommost or rightmost.)

Let's demonstrate this description with a simple example ⁴:

digraph G
{
  {rank=source; a->b;}
  {rank=same; c->d;}
}

This example gives a graph with two rows. a->b is above c->d. However, if I change {rank=source; a->b;} to {rank=min; a->b;}, we'll end up with one row: a->b will be to the left of c->d. This is due to the difference between min and source: min allows other subgraphs in the minimum rank. However, source only allows other subgraphs of min or source to be on the minimum rank (we have same in this case).

sink and max works similarly. For instance, the below example gives a picture with c->d at the top and a->b at the bottom:

digraph G{
  {rank=sink; a->b;}
  {rank=same; c->d;}
}

Of course, Graphviz is not the only tool that can produce beautiful pictures. TikZ is another popular tool. You can check out its NN example for comparison. ↩
Technically, the code used to generate the blog NN picture is this one but the code I explained above is much more concise. ↩
Check out this SO post for more examples on grouping nodes with the same attributes. ↩
The example is adapted from this SO post. ↩

Andrew Ng's ML Week 04 - 05

2017-05-23T22:20:00+08:00

Week 4 and 5 mainly talks about one important learning technique called "Neural Networks". It is especially heplful when there are many features and hence, many combinations for the linear or logistic regressions. Interestingly, I studied neural networks previously when I was a student at college. It may feel different when we revisit old friend.

Model
- Representations
- Train a neural network
Implementation details

Model

Representations

Below picture shows a typical neural network (I'll use NN as a shorthand).

$L =$ total number of layers in network (i.e. $L = 4$ for the above NN)
$S_l =$ number of units (not counting bias unit) in layer $l$ (i.e., $S_1 = 3$, $S_2 = S_3 = 5$, $S_4 = S_L = 4$)
$a_i^l =$ "activation" of unit $i$ in layer $l$. In fact, input features $x_0, x_1, x_2, x_3$ can also be represented as $a_0^{(1)}, a_1^{(1)}, a_1^{(2)}, a_1^{(3)}$ respectively.
$\Theta^{(l)} =$ matrix of weights controlling function mapping from layer $l$ to layer $l+1$. For example,

$$ \Theta^{(1)} = \begin{bmatrix} \theta_{10}^{(1)} && \theta_{11}^{(1)} && \theta_{12}^{(1)} && \theta_{13}^{(1)} \\ \theta_{20}^{(1)} && \theta_{21}^{(1)} && \theta_{22}^{(1)} && \theta_{23}^{(1)} \\ \dots \\ \theta_{50}^{(1)} && \theta_{51}^{(1)} && \theta_{52}^{(1)} && \theta_{53}^{(1)} \\ \end{bmatrix} $$

Note

Notation here may look confusing. One example to help understand is $\theta_{10}^{1}$ means weight from $x_0$ in layer $1$ to $a_1$ in layer $2$. In other words, $\theta_{ji}^{l}$ means weight from $a_i^{l}$ to $a_j^{l+1}$. Then the rows in the matrix can be thought of as the weights from neurons in layer $l$ to corresponding $a_j$ in layer $l+1$ (i.e., 1st row in the above example means weights from layer $1$ to $a_1$ in layer $2$). Explicitly, the number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows is equal to the number of nodes in the next layer (excluding the bias unit).

$K =$ number of neurons in the output layer (i.e. $S_L = K$). In other words, $K$ represents the number of classes in multi-class classification. This indicates that $h_\theta(x) = \mathbb{R}^K$.

Note

Usually, in our training sets {$(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})$}, we are given actual label (i.e. $y^{(9)} = 10$ for handwritten digit recognition). However, we need to transform those labels into $\mathbb{R}^k$ by doing,for instance, create $\mathbb{R}^{10}$ vector with last position being $1$ and rest being $0$ as the representation for $y^{(9)} = 10$.

With the above notations, we have the following property:

If NN has $S_l$ units in layer $l$, $S_{l+1}$ units in layer $l+1$, then $\Theta^{(l)}$ will be dimension $S_{l+1} \times (S_l + 1)$. $+1$ comes from the bias unit (shown in yellow in above NN picture).

Train a neural network

1. Pick a network architecture

The first step is to pick a network architecture. Specifically, the connectivity patterns between neurons. Prof. Ng says a reasonable default is to either have $1$ hidden layer, or if $>1$ hidden layer, have the same number of hidden units in every layer. Usually, the more hidden units the better.

2. Randomly initialize weights

Zero initialization is considered bad for NN (i.e. $\theta_{ij}^{l} = 0$ for all $i,j,l$) because our activation output and gradient will all be identical and essentially we comput one feature in this network. That's why we need to randomly initialize the weights for symmetry breaking.

One effective strategy is to randomly select values for $\theta_{ij}^{l}$ uniformly in the range [$-\epsilon_\text{init}$,$\epsilon_\text{init}$]. We can choose $\epsilon_\text{init}$ based upon the number of units in the network. A good choice of $\epsilon_\text{init}$ is $\epsilon_\text{init} = \frac{\sqrt{6}}{\sqrt{L_\text{in} + L_\text{out}}}$, where $L_\text{in} = S_l$ and $L_\text{out} = S_{l+1}$, which are the the number of units in the layers adjacent to $\Theta^{(l)}$. Take above NN as an example, our $\epsilon_\text{init}$ will be $0.87$, which is calculated from $\frac{\sqrt{6}}{\sqrt{3+5}}$. ¹

3. Forward propagation

The next step we need to do is to use forward propagation to get $h_\theta(x^{(i)})$ for any $x^{(i)}$. Let's use above NN as an example to demonstrate how forward propagation is done. There are $4$ output units in the output layer and thus, our $h_\theta(x^{(i)})$ looks like

$$ h_\theta(x^{(i)}) = \begin{bmatrix} a_1^{(4)} \\ a_2^{(4)} \\ a_3^{(4)} \\ a_4^{(4)} \\ \end{bmatrix} $$

The general idea for the forward propagation is that we take in the input from previous layer, and multiply with our weights, and then apply our sigmoid function to get the activation value for the current layer. We start with the input layer and do this iteratively until we get to output layer, which its activation value will be our $h_\theta(x^{(i)})$.

Concretely, let's first represent our input layer (with bias term) as $x$ and define a new variable $z^{(j)}$ as following:

$$ \begin{align*} & x = \begin{bmatrix} x_0 \\ x_1 \\ \dots \\ x_n \end{bmatrix} && z^{(j)} = \begin{bmatrix} z_1^{(j)} \\ z_2^{(j)} \\ \dots \\ z_n^{(j)} \end{bmatrix} \end{align*} $$

Then, we can calculate the activation value $a^{(j)}$ for the layer j as follows (treating $x = a^{(1)}$):

Add bias term $a_0^{(j-1)} = 1$ to $a^{(j-1)}$ and our new $a^{(j-1)}$ looks like

$$ a^{(j-1)} = \begin{bmatrix} a_0^{(j-1)} \\ a_1^{(j-1)} \\ \dots \\ a_n^{(j-1)} \end{bmatrix} $$
Calculate $z^{(j)}$ as follows:

$$ z^{(j)} = \Theta^{(j-1)}a^{(j-1)} $$

Here, $\Theta^{(j-1)}$ has dimension $S_j \times (S_{j-1} + 1)$ and $a^{(j-1)}$ has dimension $(S_{j-1} + 1) \times 1$. Then, our vector $z^{(j)}$ has height $S_j$.
We get a vector of our activation nodes for layer $j$ as follows:

$$ a^{(j)} = g(z^{(j)}) $$

We repeat these three steps and get $h_\theta(x^{(i)})$, which in our NN is the activation value $a^{(4)}$ for $i$-th training example.

One key intuition for forward propagation is that the whole process is just like logistic regression except that rather than using original feature $x_1, x_2, \dots, x_n$, it uses new features $a^{(L-1)}$, which are learned by the NN itself.

4. Cost function $J(\theta)$

Now we need to compute the cost function $J(\theta)$ of the NN in order to minimize the classification error with the given data. Since NN shares a lot similarity with the logistic regression, it's no hard to imagine that the NN's cost function $J(\theta)$ shares the similar form with the logistic regression's cost function:

$$ J(\theta) = - \frac{1}{m} [ \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log h_\theta(x^{(i)})_k + (1 - y_k^{(i)}) \log(1-h_\theta(x^{(i)})_k)] + \frac{\lambda}{2m} \sum_{l=1}^{L-1}\sum_{i=1}^{S_l}\sum_{j=1}^{S_l+1}(\theta_{ji}^{(l)})^2 $$

Here, $h_\theta(x^{(i)})_k$ means the $k$th output in the output layer. The second part of the equation summs over all the weights $\theta_{ji}^{(l)}$ except the bias term (i.e. $i=0$).

5. Backpropagation

Once we have the cost function, our next step is to find the derivative terms $\frac{\partial J(\theta)}{\partial \theta_{ij}^{(l)}}$ for every $i,k,l$ in order to use various octave built-in method (i.e. fminunc) to minimize $J(\theta)$ as a function of $\theta$. We use backpropagation to do this.

The intuition for the backpropagation is the following: given a training example $(x^{(i)}, y^{(i)})$, we will first run forward propagation to compute all the activiations throughout the network, including the output units. Then, for each node $j$ in layer $l$, we would like to compute an "error term" $\delta_j^{(l)}$ that measures how much that node was "responsible" for any errors in our output. For an output node, we can directly measure the difference between the network's activation and the true target value, and use that to define $\delta_j^{(L)}$. For the hidden units, we can compute $\delta_j^{l}$ based on a weighted average of the error terms of the nodes in layer $(l+1)$.

Here is the algorithm in details:

Given training set {$(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})$}
Set $\Delta_{ij}^{(l)} = 0$ (for all $i,l,j$)
For i=1:m,
1. perform forward propagation to compute $a^{(l)}$ for $l = 2, 3, \dots, L$
2. using $y^{(i)}$, compute $\delta^{(L)} = a^{(L)} - y^{(i)}$
3. compute $\delta^{(L-1)}, \delta^{(L-2)}, \dots, \delta^{(2)}$ using $\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)}).\ast a^{(l)}.\ast (1-a^{(l)})$
4. $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}$ (Vectorized form is $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$)
$D_{ij}^{(l)} := \frac{1}{m}\Delta_{ij}^{(m)} + \frac{\lambda}{m}\theta_{ij}^{(l)} \text{ if } j \ne 0$ and $D_{ij}^{(l)} := \frac{1}{m}\Delta_{ij}^{(m)} \text{ if } j = 0$.
$\frac{\partial J(\theta)}{\partial \theta_{ij}^{(l)}} = D_{ij}^{(l)}$

Intuitvely, backpropagation algorithm is alot like forward propagation running backward. We can then use gradient descent or advanced optimization method to try to minimize $J(\theta)$ as a function of parameters $\theta$ ².

Note

Notice that we don't compute $\delta_{(1)}$ because $\delta_{(1)}$ is associated with the input layer, which are features we observed from the training examples. So, there are no "error" involved. In addition, $.\ast$ means we do element-wise multiplication in octave.

Implementation details

Week 5's programming assignment on NN learning is the most challenging one I have met so far in this course. Initially, I plan to go through lots of details in terms of implementation in this section. However, after I finish the model section above and take a look at the assignment code again, I realize that the algorithms described above reflect fair accurately on how the code should be written.

However, there is one point I want to emphasize $a^{(1)}$ is a vector with dimension $n \times 1$. This is important if you want to apply the algorithms exactly. When I first coded the program, my $a^{(1)}$ is a row vector with dimension $1 \times n$, which causes me much trouble for the rest of implementations.

$\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$ looks confusing for me as well for the first time. My question is how many $\Delta^{(l)}$ are there. My trick is to take a look at the last term of the equation. $\delta^{(l+1)}(a^{(l)})^T$ indicates that $\Delta^{(l)}$ starts with the second last layer and there is one until input layer (including). So, in our NN above, there are three $\Delta^{(l)}$ we should update.

Here, it is unclear for me which two layers we should choose to calculate $\epsilon_\text{init}$. In the programming assignment 4, the value is calculated from the layer 1 (input layer) and layer 2 (1st hidden layer). ↩
You can use gradient checking to verify if the backpropagation is implemented correctly. ↩

Andrew Ng's ML Week 01 - 03

2017-05-05T16:18:00+08:00

ML overview
- What is ML?
- Types of ML problems
Notation
Linear regression
- In theory
- In practice
Linear regression with regularization
- In theory
- In practice
Logistic regression
- In theory
- In practice
Logistic regression with regularization
- In theory
- In practice

In my introducing post, I mention that I decide to write summary post weekly for the course. However, in practice, I find it is very hard to do. This is mainly because I want to keep the progress in MAW reading while meet the coursework deadlines. So, I decide to do the summary post based upon the module of the material itself.

In addition, like MAW reading posts, I will focus on the reflection and the post itself may not be self-contained. However, this may happen rarely.

Coursera has really well-designed programming assignment that really helps to understand both concepts and its actual implementation. All the code snippets listed in the below and upcoming posts are availabe in my code-for-blog repo.

ML overview

What is ML?

The biggest take-away for me is that ML is to solve the problems that cannot be easily solved by the programming. As mentioned by Prof. Andrew, we know how to program the shortest path from A to B but we may have hard time to program a solution to do image tagging, email spam checking, and so on. The way we solve those problems is by teaching computers to do things like us through learning algorithms.

There are a lot of examples about ML mentioned in the video:

Database mining: large datasets from growth of automation/web (i.e. web click data, medical records, biology, engineering)
Applications can't program by hand. (i.e. autonomous helicopter, handwriting recognition, most of NLP, CV)
Self-customizing programs (i.e. Amazon, Netflix product recommendations)
Understanding human learning (brain, real AI)

There are two definitions for ML:

Arthur Samuel: the field of study that gives computers the ability to learn without being explicitly programmed. (older, informal definition)
Tom Mitchell: A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.

Take playing checkers as an example. $E = \text{the experience of playing many games of checkers}$; $T = \text{the task of playing checkers}$; $P = \text{the probability that the program will win the next game}$.

Types of ML problems

There are two general types: Supervised learning and Unsupervised learning.

Supervised learning: 'right' answer given
- Regression: predict continuous valued output
  - EX1: given data about the size of houses on the real estate market, try to predict their price.
  - EX2: given a picture of a person, we predict their age on the basis of the given picture.
- Classification: predict results in a discrete output (categories)
  - EX1: predict whether the house sells for more or less than the asking price.
  - EX2: given a patient with a tumor, we predict whether the tumor is malignant or benign.
Unsupervised learning: little or no idea what our resuls should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
- Clustering: take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables (i.e. lifespan, location, roles)
- Non-clustering: the "cocktail party algorithm" allows you to find structure in a chaotic environment (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party)
- Other application fields: organize computing clusters, social network analysis, market segmentation, astronomical data analysis

Notation

A few notation used throughout the course:

$n = \text{number of features}$
$m = \text{number of training examples}$
$x^{(i)} = \text{input (features) of }i\text{th training example}$
$x_j^{(i)} = \text{value of feature }j \text{ in }i\text{th training example}$

Linear regression

In theory

For linear regression, our hypothesis is

$$ h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \dots + \theta_n x_n = \theta^T x $$

where

$$ \begin{align} & x = \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} \in \mathbb{R} ^{n+1} \label{eq:1} & & \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} \in \mathbb{R} ^{n+1} \end{align} $$

and our cost function is

$$ \begin{eqnarray} J(\theta) &=& \frac{1}{2m} \sum_{i=1}^m(\theta^T x^{(i)}-y(i))^2 \label{eq:2} \\ &=& \frac{1}{m} \sum_{i=1}^m \underbrace{\frac{1}{2}(\theta^T x^{(i)}-y(i))^2}_{\text{cost}(h_\theta(x),y)} \label{eq:7} \end{eqnarray} $$

Note

$2$ in the above equation is a convenience for the computation of the gradient descent as the derivative term of the square function will cancel out the $\frac{1}{2}$ term.

In order to find $\theta$ that minimizes our cost function $J(\theta)$. Two methods are available for us:

Gradient Descent

$$ \begin{align} \text{Repeat\{ } && \nonumber\\ && \theta_j := \theta_j - \alpha \times \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} && \text{(simultaneously update $\theta_j$ for $j = 0, 1, \dots, n$)} \label{eq:3}\\ \text{\}} \nonumber \end{align} $$

$\alpha$ is called learning rate, which determines "the step we take downhill" and the part afterwards decides which direction we want to go (derived by taking partial derivatives against $\theta_j$)¹.

Note

$J(\theta)$ should decrease after every iteration of batch gradient descent. If it is not, we want to try smaller $\alpha$. However, if $\alpha$ is too small, gradient descent can be slow to converge (i.e. $J(\theta)$ decreases by less than $\epsilon$ (i.e. $10^{-3}$) in one iteration). If $\alpha$ is too large, $J(\theta)$ may not decrease on every iteration; may not converge. Thuse, to choose $\alpha$, we can try a range of values, say $\dots, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, \dots$.

Normal Equation

We just directly calculate the partial derivatives for every $\theta_j$ and set it equals to zero (i.e $\frac{\partial}{\partial \theta_j}J(\theta) = 0$ for every $j$) and we get:

$$ \begin{equation} \theta = (X^TX)^{-1}X^Ty \label{eq:4} \end{equation} $$

where $X$ is called design matrix, and it has form

$$ \begin{align*} & X = \left[\begin{array}{ccc} - & (x^{(1)})^T & - \\ - & (x^{(2)})^T & - \\ & \vdots & \\ - & (x^{(m)})^T & -\end{array} \right] & x^{(i)} = \begin{bmatrix}x_0^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} \in \mathbb{R} ^{n+1} \end{align*} $$

In practice

One tricky thing I find out when I work through quiz and programming problems is the gap between the mathematical representation and the actual implementation.

For the cost function \ref{eq:2}, we implement it in Octave as following:

J = 1/(2*m) * (X*theta-y)' * (X*theta - y);

where

$$ X = \begin{bmatrix} x_0^{(1)} && x_1^{(1)} && \dots && x_n^{(1)} \\ x_0^{(2)} && x_1^{(2)} && \dots && x_n^{(2)} \\ \vdots \\ x_0^{(m)} && x_1^{(m)} && \dots && x_n^{(m)} \end{bmatrix} $$

Note that $X$ here is different from \ref{eq:1} because $X$ here is to faciltate the vectorized cost function calculation in program (i.e Octave) and it is natural fit with how the data actually loaded into the program.

Also, if you take a look at our octave calculation above, we explictly avoid doing summation in \ref{eq:2}. We can put both vectorized form used in octave and mathematical definition side by side to see the pattern:

$$ J(\theta) = \frac{1}{2m}(X\theta-y)^T(X\theta-y) = \frac{1}{2m} \sum_{i=1}^m(\theta^T x^{(i)}-y(i))^2 $$

Matrix transpose times matrix itself is a commonly-seen technique that is used to avoid explictly summation.

For gradient descent, we can calculate like the following in octave:

theta = theta - alpha * 1/m * (X'*(X*theta - y));

Let me use an example to illustrate why we can calculate \ref{eq:3} like above. Suppose $m = 4$ with $h_\theta(x) = \theta_0x_0+\theta_1x_1$. Then, we have

$$ \begin{align*} X = \begin{bmatrix} x_0^{(1)} && x_1^{(1)} \\ x_0^{(2)} && x_1^{(2)} \\ x_0^{(3)} && x_1^{(3)} \\ x_0^{(4)} && x_1^{(4)} \\ \end{bmatrix} && \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix} && h_\theta(x^{(i)}) - y^{(i)} = \begin{bmatrix} \theta_0 + \theta_1x_1^{(1)} - y^{(1)} \\ \vdots \\ \theta_0 + \theta_1x_1^{(4)} - y^{(4)} \end{bmatrix} \end{align*} $$

so now we can show why:

$$ \begin{eqnarray*} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} \text{ for all $j$ } &=& (\theta_0 + \theta_1x_1^{(1)} - y^{(1)}) \begin{bmatrix} x_0^{(1)} \\ x_1^{(1)} \end{bmatrix} + \dots + (\theta_0 + \theta_1x_1^{(4)} - y^{(4)}) \begin{bmatrix} x_0^{(4)} \\ x_1^{(4)} \end{bmatrix} \\ &=& \begin{bmatrix} x_0^{(1)} && x_0^{(2)} && x_0^{(3)} && x_0^{(4)} \\ x_1^{(1)} && x_1^{(2)} && x_1^{(3)} && x_1^{(4)} \end{bmatrix} \begin{bmatrix} \theta_0 + \theta_1x_1^{(1)} - y^{(1)} \\ \vdots \\ \theta_0 + \theta_1x_1^{(4)} - y^{(4)} \end{bmatrix} \end{eqnarray*} $$

For normal equation, we can calculate like the following in octave:

theta = pinv(X'*X)*X'*y;

This is no different than \ref{eq:4} we mentioned above.

Linear regression with regularization

Quite often, we may face overfitting issue, which can be fixed by either reduce number of features or regularization.

Regularization is to keep all the features, but reduce magnitude (values) of parameters $\theta_j$. By doing so, we can make our hypothesis simpler and less prone to overfitting.

In theory

With regularization, our new cost function becomes

$$ J(\theta) = \frac{1}{2m}\Big[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \underbrace{\lambda \sum_{j=1}^n \theta_j^2\Big]}_\textrm{regularization term} $$

The regularization parameter $\lambda$ controls the tradeoff between "fit the data well" and "keep parameters small to avoid overfitting". If $\lambda$ is set to an extremely large value, then we may face "underfit" issue (i.e. all $\theta_j$ for $j = 1, \dots, n$ close to 0)².

Since our cost function has changed, both gradient descent and normal equation have to adjust accordingly:

Gradient Descent

$$ \begin{align} \text{Repeat\{ } && \nonumber \\ \theta_0 := \theta_0 - \alpha \times \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)} && \label{eq:5} \\ \theta_j := \theta_j - \alpha \times \lbrack \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j\rbrack && (j = 1,2,3, \dots, n) \label{eq:6} \\ \text{\}} \nonumber \end{align} $$

Here, it might be a good time to write out the gradient explicitly (rather than embedding them in the gradient descent algorithm). Gradient descent is only one of many algorithms that optimizes a given function. We will use other algorithms later in the course and the only thing they require is the gradients.

$$ \begin{align*} \frac{\partial J(\theta)}{\partial \theta_0} &=& \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} && \text{ for } j = 0 \\ \frac{\partial J(\theta)}{\partial \theta_j} &=& \Big(\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\Big) + \frac{\lambda}{m}\theta_j && \text{ for } j \ge 1 \end{align*} $$

Normal Equation

$$ \theta = (X^TX + \lambda \begin{bmatrix} 0 && && && \\ && 1 && && \\ && && \ddots && \\ && && && 1 \end{bmatrix} )^{-1}X^Ty $$

we want $\lambda > 0$ so that the matrix is invertible.

In practice

Linear regression regularization implementation doesn't differ from no-regularization implementation in terms of matrices implementation technique. The following code chunk demonstrates a way to calculate the cost function $J(\theta)$ and the gradients (not gradient descent):

J = 1/(2*m)*(sum((X*theta-y).^2)) + lambda/(2*m)*(sum(theta(2:end).^2));         
G = (lambda/m) .* theta;
G(1) = 0; 
grad = ((1/m) * X' * (X*theta - y)) + G;
grad = grad(:);

Logistic regression

Logistic regression is a classification algorithm. It is better than the linear regression because 1) linear regression classification result is higly impacted by the outliers 2) linear regression result $h_\theta (x)$ can output value $>1$ or $<0$, which doesn't fit with the nature of classification task.

In contrast, as we will see, logistic regression output $0 \ge h_\theta (x) \le 1$, which can be intrepreted from probabily perspective.

In theory

Logistic regression hypothesis is

$$ h_\theta(x) = g(\theta^Tx) \text{ where $g(z) = \frac{1}{1+e^{-z}}$} $$

This hypothesis can be intrepreted as the probability that $y = 1$ given $x$ and $\theta$ (i.e. $h_\theta(x) = P(y = 1 | x;\theta)$)

In the linear regression, we have cost function \ref{eq:2}. However, $\text{cost}(h_\theta(x),y)$ cannot work for logistic regression because $J(\theta)$ is not convex. In order to make $J(\theta)$ convex, we have the following $\text{cost}(h_\theta(x),y)$ for logistic regression

$$ \text{cost}(h_\theta(x),y)=\left\{ \begin{array}{ll} -\log (h_\theta(x)) \text{ if } y = 1 \\ -\log (1 - h_\theta(x)) \text{ if } y = 0 \end{array} \right. $$

We can rewrite the above equation as $\text{cost}(h_\theta(x),y) = -y \log(h_\theta(x)) - (1-y) \log (1-h_\theta(x))$ and then the cost function $J(\theta)$ is

$$ J(\theta) = -\frac{1}{m}\sum_{i=1}^m \lbrack (y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta(x^{(i)})))\rbrack $$

To minimize cost function $J(\theta)$ we can of course use gradient descent. Surprisingly, the gradient descent for logistic regression is exactly the same as the gradient descent for linear regression \ref{eq:3}.

However, in the course, we directly use the fminunc from Octave to do the optimization. Internally, the function use advanced optimization technique that can avoid manually picking $\alpha$ in gradient descent and find the optimal $\theta$ faster than gradient descent.

Note

For multiclass classification problem, we have $h_\theta^{(i)}(x) = p(y=1 | x; \theta)$ where $i = 1,2,3,...$. Then, we can train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y=i$. On a new input $x$, to make predication, pick the class $i$ that gives highest $h_\theta^{(i)}(x)$. We are going to predict $y$ with that value $i$.

In practice

The implementation for cost function and gradient descent for logistic regression should be no hard for us now:

% cost function for logistic regression
J = 1/m * sum( ...
                 -y'*log(sigmoid(X*theta))- ...       
                 (1-y)'*log(1-sigmoid(X*theta)) ...    
             );

% gradient descent for logist regression
grad = 1/m * X'*(sigmoid(X*theta) - y);

Logistic regression with regularization

In theory

The cost function for regualarized logistic regression is following:

$$ J(\theta) = -\frac{1}{m}\sum_{i=1}^m \lbrack (y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta(x^{(i)})))\rbrack + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2 $$

and the gradient descent looks exactly the same as the regualarized linear regression \ref{eq:5} and \ref{eq:6}.

In practice

The following code chunk shows the cost function and gradient descent for regularized logistic regression:

m = length(y);     % number of training examples
t = size(theta);   % number of theta parameters

J = 0;
grad = zeros(t);

J = 1/m * sum( ...
               -y'*log(sigmoid(X*theta)) ...
               -(1-y)'*log(1-sigmoid(X*theta)) ...
             ) ...
          + lambda / 2 / m * theta(2: t)'*theta(2: t);

grad(1) = (1/m * X'*(sigmoid(X*theta) - y))(1);
grad(2:t) =  (1/m * X'*(sigmoid(X*theta) - y) + lambda/m * theta)(2: t);

We may need to do feature scaling when we work with gradient descent. ↩
We don't penalize $\theta_0$. ↩

Shell sort

2017-05-01T21:33:00+08:00

Per the final paragraph of the last post, the algorithm needs to avoid doing adjacent swap (in other words, comparing elements that are distant) so that we can have the opportunity to remove more than one inversion for each swap, which can break $O(N^2)$ barrier. This is exactly what shellsort tries to achieve.

Concept

Shellsort is referred as diminishing increment sort: it works by swapping non-adjacent elements; the distance between comparisons decreases as the algorithm runs until the last phase, in which adjacent elements are compared.

Concretely, shellsort uses an increment sequence $h_1, h_2, \dots, h_t$ ¹:

We start with $k=t$
Sort all subsequences of elements that are $h_k$ apart so that $A[i] \le A[i+h_k]$ for all i. In other words, all elements spaced $h_k$ apart are sorted. ($h_k$-sort)
Go to the next smaller increment $h_{k-1}$ and repeat until $k = 1$

A popular but poor choice for incremenet sequence is: $h_t = \lfloor{N/2}\rfloor$ and $h_k = \lfloor{h_{k+1}/2}\rfloor$ proposed by shell.

Here is the shellsort using Shell's increments ²:

void
shellSort(int A[], int N)
{
  int i, j, increment, tmp;
  for (increment = N/2; increment > 0; increment /= 2)
    for(i = increment; i < N; i++)
    {
      tmp = A[i];
      for(j = i; j >= increment; j -= increment)
        if (tmp < A[j-increment])
          A[j] = A[j-increment];
        else
          break;
      A[j] = tmp;
    }
}

Here is an example of the algorithm in action (using Shell's increment sequence):

| index        | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| original     | 81 | 94 | 11 | 96 | 12 | 35 | 17 | 95 | 28 | 58 | 41 | 75 | 15 |
|--------------|----|----|----|----|----|----|----|----|----|----|----|----|----|
| After 6-sort | 15 | 94 | 11 | 58 | 12 | 35 | 17 | 95 | 28 | 96 | 41 | 75 | 81 |
| After 3-sort | 15 | 12 | 11 | 17 | 41 | 28 | 58 | 94 | 35 | 81 | 95 | 75 | 96 |
| After 1-sort | 11 | 12 | 15 | 17 | 28 | 35 | 41 | 58 | 75 | 81 | 94 | 95 | 96 |

Analysis

The running time of shellsort depends on how we pick the increment sequence. MAW gives running time for two commonly-seen increment sequences:

The worst-case running time of Shellsort, using Shell's increments, is $\Theta(N^2)$.
The worst-case running time of Shellsort, using Hibbard's increments ($1,3,7, \dots, 2^k-1$) ³, is $\Theta(N^{3/2})$.

The average case time is $O(N^{3/2})$ by using Hibbard's increments. The worst case time is the sequence when smallest elements in odd positions, largest in even positions (i.e. 2,11,4,12,6,13,8,14) when we use shell's sequence. Only last pass (i.e. $h_1 = 1$) will do the work and it becomes an insertion sort with $O(N^2)$. The best case can happen when we set the increment sequence to be 1 for any pass and we have a sorted array. In this case, we have $O(N)$.

Shellsort is good for up to $N \approx 10000$ and its simplcity makes it a favorite.

Properties

an $h_k$-sorted array that is then $h_{k-1}$ sorted remains $h_k$ sorted (why algorithm works).
the action of an $h_k$-sort is to perform an insertion sort on $h_k$ independent subarrays with size about $N/h_k$ elements (i.e. $h_k = 6$ then there are 6 subarrays(by index): {0,6,12}, {1,7}, {2,8}, {3,9}, {4,10}, {5,11}).
a larger increment swaps more distant pairs (natural derivation of the above property).

Links to resources

Here are some of the resources I found helpful while preparing this article:

MAW Chapter 7
Sorting cheat sheet from Duke U.
Lecture 15 and lecture 16 from U.Washington
Notes from MIT
Lecture from U.Rochester

Any increment sequence will do as long as the last increment is 1 (i.e. $h_1 = 1$). However, choosing the increment is a practice of art: some choices dominate others. ↩
As suggested by the algorithm above, the general strategy to $h_k$-sort is for each position, $i$, in $h_k, h_k+1, \dots, N-1,$ place the element in the correct spot among $i, i-h_k, i-2h_k$, etc. ↩
The key difference between Hibbard's increments and Shell's increments is the adjacent increments have no common factors. The problem with Shell's increments is that we keep comparing the same elements over and over again. We need to increment so that different elements are in different passes. ↩

Simple sorting algorithms

2017-04-24T21:33:00+08:00

This post summarizes three typical simple sorting algorithms: bubble sort, selection sort, and insertion sort. In chapter 7, MAW mainly talks about insertion sort but for the sake of completeness, I will include the other two as well ¹.

Bubble sort
- Concept
- Analysis
Selection sort
- Concept
- Analysis
Insertion sort
- Concept
- Analysis
A Lower Bound for Simple Sorting Algorithms
Links to resources

Bubble sort

Concept

The idea for the bubble sort is to "bubble" larger elements to the end of array by comparing $i$ and $i+1$, and swapping if $A[i] > A[i+1]$. We repeat this from the first to the end of unsorted part of the array.

The following code demonstrates the actual algorithm

#define SWAP(a,b)  {int t; t = a; a = b; b = t;}
void bubbleSort(int A[], int n)
{
  int i, j;
  for(i = 0; i < n; i++) // n passes thru the array
    for(j = 1; j < (n-i); j++) // from start to the end of unsorted part
      if(A[j-1] > A[j]) SWAP(A[j-1], A[j]); 
}

The key for the alogorithm is that we only do the "bubble up" operation for the unsorted part. The following gives an example of the algorithm in action:

| index    | 0  | 1  | 2  | 3  | 4  | 5  |
| original | 34 | 8  | 64 | 51 | 32 | 21 |
|----------|----|----|----|----|----|----|
| pass 0   | 8  | 34 | 51 | 32 | 21 | 64 |
| pass 1   | 8  | 34 | 32 | 21 | 51 | 64 |
| pass 2   | 8  | 32 | 21 | 34 | 51 | 64 |
| pass 3   | 8  | 21 | 32 | 34 | 51 | 64 |
| pass 4   | 8  | 21 | 32 | 34 | 51 | 64 |
| pass 5   | 8  | 21 | 32 | 34 | 51 | 64 |

Analysis

Bubble sort is stable and in place. The running time is $O(N^2)$, which is true for both worst case and average case. $O(N)$ can be achieved in the best case, where the array is sorted or mostly sorted (possible a few elements a place or two away from their correct spots).

Selection sort

Concept

The idea for selection sort is to scan array and select small key and swap it with the first element of the array (i.e. $A[0]$); scan remaining keys, select the smallest and swap with the second element (i.e. $A[1]$); repeat the whole process until last element is reached. In other words, after $i$th pass, first $i$ elements are sorted and in proper position.

Like the bubble sort, we divide the whole array into sorted part and unsorted part: we start with unsorted array and keep the sorted array at the beginning. Each time we scan the unsorted part of the array and decide which element should go next into the sorted part. However, unlike bubble sort, we build the sorted part from the beginning of the array (in bubble sort, we start with moving the largest element to the end of array).

The following code demonstrates the actual algorithm

void
selectionSort(int A[], int N)
{
  int i, j, min;
  j = min = i = 0;
  for(; i < N-1; i++)
  {
    for(j = i; j < N; j++)
      if(A[j] < A[min])
        min = j;
    swap(&A[min], &A[i]);
  }
}

Here is an example of the algorithm in action:

| index    | 0  | 1  | 2  | 3  | 4  | 5  |
| original | 34 | 8  | 64 | 51 | 32 | 21 |
|----------|----|----|----|----|----|----|
| pass 0   | 8  | 34 | 64 | 51 | 32 | 21 |
| pass 1   | 8  | 21 | 64 | 51 | 32 | 34 |
| pass 2   | 8  | 21 | 32 | 51 | 64 | 34 |
| pass 3   | 8  | 21 | 32 | 34 | 64 | 51 |
| pass 4   | 8  | 21 | 32 | 34 | 51 | 64 |

Analysis

The selection sort is NOT STABLE but in place. Selection sort is not sensitive to the input and thus running time should be the same in best, average, and worst cases: We go through $N-1$ passes with $N-1, \dots, 1$ comparisons, which is $O(N^2)$.

Since selection sort is insensitive to the data, it's good if we want to have our sort routine always take the same time.

Insertion sort

Concept

The idea for insertion sort is that we insert an as-yet-unprocessed record into a sorted list of the records processed so far. In details, insertion sort consists of $N-1$ passes. For pass $P = 1$ through $N-1$, insertion sort ensures that the elements in positions $0$ through $p$ are in sorted order. In pass $P$, we move the element in position $P$ left until its correct place is found among the first $P+1$ elements.

The following code demonstrates the actual algorithm

void
insertionSort(int A[], int N)
{
  int j, P;
  int tmp;
  for(P = 1; P < N; P++)
  {
    tmp = A[P];
    for(j = P; j > 0 && tmp < A[j-1]; j--)
      A[j] = A[j-1];
    A[j] = tmp;
  }
}

Here is an example of the algorithm in action:

| index    | 0  | 1  | 2  | 3  | 4  | 5  |
| original | 34 | 8  | 64 | 51 | 32 | 21 |
|----------|----|----|----|----|----|----|
| pass 1   | 8  | 34 | 64 | 51 | 32 | 21 |
| pass 2   | 8  | 34 | 64 | 51 | 32 | 21 |
| pass 3   | 8  | 34 | 51 | 64 | 32 | 21 |
| pass 4   | 8  | 32 | 34 | 51 | 64 | 21 |
| pass 5   | 8  | 21 | 32 | 34 | 51 | 64 |

Analysis

Due to the nested loops, the running time is $O(N^2)$, which can be achieved when the input array is in reverse sorted order. In the best case, where the input array is already sorted, the running time is $O(N)$. For the average case, the running time is $O(N^2)$. In fact, the bound is tight for both average case and worst case: $\Theta (N^2)$.

In addition, insertion sort is stable and in place. Insertion sort is the most effectively used on input array with roughly $N < 20$ and for almost sorted array.

A Lower Bound for Simple Sorting Algorithms

An inversion is a pair of elements in wrong order (i.e. $i < j$ but $A[i] > A[j]$).
Simple sorting algorithms presented in this post swap adjacenet elements (explicitly or implicitly) removes one inversion per swap. This makes the running time proportional to number of inversions in array.
The average number of inversions in an array of $N$ distinct numbers is $N(N-1)/4$.
Any algorithm that sorts by exchanging adjacent elements requires $\Omega (N^2)$ time on average. This is due to the fact that each adjacent swap removes only one inversion.

As you can tell, to break $O(N^2)$ barrier, we must remove more than one inversion for each swap. Adjacent elements swap will certainly not help us to achieve this goal. The idea is that we try to swap the elements that are far apart and hopefully we can remove more than one inversion for each swap. Shell sort is the first algorithm to break $O(N^2)$ running time. I'll talk about it in my next post.

Links to resources

Here are some of the resources I found helpful while preparing this article:

bubble sort video, selection sort video, insertion sort video and this animation can help you understand the concept. ↩

Introducing the "Andrew Ng's ML course study notes"

2017-04-21T23:48:00+08:00

I finally enrolled in Andrew Ng's machine learning course on Coursera. Here is my expectation for this course:

Get a fun intro to machine learning field.

I studied mathematics, statistics, and took AI course when I was an undergraduate but I never had an intro to ML formally. All the work I have done so far has a very strong relationship with ML but they don't really target on ML specifically. So, I think Andrew Ng's ML course is probably a great intro to this field.

In preparation for my graduate studies.

I'm thinking of pursuing a research career in NLP and robotics but I need to some ground work to see if they are actually fun like I'm picturing in my mind. In addition, before actually taking graduate level related courses, there might be some gaps I need to fill out. So I think Andrew Ng's course may be a greate bridge course to get me warm up for the serious graduate level ML studies.

Take a break from Algo studies and keep myself motivated.

I'm currently working on a reading project to finish MAW by the end of this September. The progress so far is on track and I'm having a lot of fun with the book. However, sometimes, I want to experience some different flavor of dishes and take a break. In addition, I'm thinking of the next reading project I'm going to do. It's highly likely going to be a book in linear algebra. I have taken linear algebra before in the college but I found the subject can become quite boring very soon if you don't have specific problems or needs want to address. Hopefully, Andrew Ng's course will help me to find some motivation to study linear algebra well.

Start to make résumé ML-ish

I'm working in DB field but I always want to do ML by nature judged by my performance in the ML-related courses. Taking ML course on coursera and having a nice badge on my LinkedIn may greatly help me to market my ML expertise in the future?

My current plan is to finish the coursera version first, and then move on to CS229 version if time permits. I want to take an agile approach to this material by doing iteratively build-up.

Sorting prelim

2017-04-18T10:03:00+08:00

Chapter 7: Sorting will have some rigorous analysis of the sorting algorithms (no wonder as suggested by the title of the book). Some meta-concepts related with sorting appeard at the very beginning of the chapter. I usually push them to the end-chapter summary post but this time I decide to do a writeup beforehand because I find it is really hard to talk about various sorting schemes without setting up some ground concepts first.

Definitions

Sorting problem: Given an array $A$, output $A$ such that: For any $i$ and $J$, if $i < j$, then $A[i] \le A[j]$.¹
Sorting algorithm using comparison operators (i.e $<, >, =$) is known as comparison-based sorting. Another major type is called counting sort (i.e. Radix sort).
If the entire sort can be done in main memory (i.e number of elements is relatively small, usually less than a million), we call it internal sorting. By the contrast, if the data is on the disk, we call it external sorting.
An algorithm requires $O(1)$ extra space is known as an in place sorting algorithm.²
A sorting algorithm is stable ³ if elements with equal keys are left in the same order as they occur in the input. In other words, we can ask ourself the question: Does it rearrange the order of input data records which have the same key value (duplicates)? If the answer is No, then the sorting algorithm is stable. One example is that Phone book is sorted by name. Now let's sort the book by country - is the list still sorted by name within each country? As you can tell, it is an extremely important property for databases.
There will be three kinds of running time mentioned in the sorting analysis:
- average case time: given an arbitrary input, what do we expect the running time to be.
- worst case time: for a particular degenerate case, how bad will the algorithm perform.
- best case time: for a particularly benevolent input case, what is the best case performance.

Links to resources

Here are some of the resources I found helpful while preparing this article:

MAW Chapter 7
Sorting cheat sheet from Duke U.
Lecture material from U.Washington and MIT

Here, for the input, we are given an array $A$ of data records, each with a key (which can be an integer, character, string, etc) as long as the following condition can be met: 1. There is an ordering on the set of possible keys 2. We can compare any two keys using $<, >, =$ ↩
Under the context of the sorting, we may ask: Does the sorting algorithm require extra memory to sort the collection of items? Do we need to copy and temporarily store some subset of the keys/data records? ↩
When we evaluate the performance of a sorting algorithm, we usually evaluate it from three perspectives: running time, memory requirements (aka space), and stability. ↩

MAW Chapter 7: Sorting writing questions

2017-04-15T23:54:00+08:00

Solutions

including: MAW 7.1, 7.2, 7.3, 7.4, 7.5.a, 7.9, 7.10, 7.11, 7.12, 7.13

MAW 7.1

Sort the sequence 3,1,4,1,5,9,2,6,5 using insertion sort.

| index    | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| original | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | 5 |
|----------|---|---|---|---|---|---|---|---|---|
| pass 1   | 1 | 3 | 4 | 1 | 5 | 9 | 2 | 6 | 5 |
| pass 2   | 1 | 3 | 4 | 1 | 5 | 9 | 2 | 6 | 5 |
| pass 3   | 1 | 1 | 3 | 4 | 5 | 9 | 2 | 6 | 5 |
| pass 4   | 1 | 1 | 3 | 4 | 5 | 9 | 2 | 6 | 5 |
| pass 5   | 1 | 1 | 3 | 4 | 5 | 9 | 2 | 6 | 5 |
| pass 6   | 1 | 1 | 2 | 3 | 4 | 5 | 9 | 6 | 5 |
| pass 7   | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 9 | 5 |
| pass 8   | 1 | 1 | 2 | 3 | 4 | 5 | 5 | 6 | 9 |

MAW 7.2

What is the running time of insertion sort if all keys are equal?

If you take a look at the code on p. 220, you can see that inner for loop checks A[j-1] > tmp and it will fail immediately. Thus, the running time is $O(N)$.

MAW 7.3

Suppose we exchange elements $A[i]$ and $A[i+k]$, which were originally out of order. Prove that at least 1 and at most $2k-1$ inversions are removed.

The inversion that existed between $A[i]$ and $A[i+k]$ is removed. This shows at least one inversion is removed. Now let's consider $A[i], A[i+1], \dots, A[i+k-1], A[i+k]$, Suppose $A[i]$ is greater than $A[i+1], \dots, A[i+k]$ and $A[i+k]$ is smaller than $A[i], \dots, A[i+k-1]$. In this case, by swapping $A[i]$ and $A[i+k]$, we fix $2k-1$ inversions ($-1$ is that $A[i]$ greater than $A[i+k]$ and $A[i+k]$ smaller than $A[i]$ points to the same inversion).

Another way to think about $2k-1$ is that for each of the $k-1$ elements $A[i+1], A[i+2], \dots, A[i+k-1]$, at most two inversions can be removed by exchange. For instance, for $A[i+1]$, two inversions are $A[i]$ and $A[i+1]$, and $A[i+1]$ and $A[i+k]$ (i.e. for sequence 10,4,3, by swapping 10 and 3, we remove inversion {10,4} and {4,3}). Thus, a maximum of $2(k-1)+1 = 2k-1$.

MAW 7.4

Show the result of running Shellsort on the input 9,8,7,6,5,4,3,2,1 using the increments {1,3,7}

| index        | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| original     | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
|--------------|---|---|---|---|---|---|---|---|---|
| after 7-sort | 2 | 1 | 7 | 6 | 5 | 4 | 3 | 9 | 8 |
| after 3-sort | 2 | 1 | 4 | 3 | 5 | 7 | 6 | 9 | 8 |
| after 1-sort | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |

MAW 7.5.a

What is the running time of Shellsort using the two-increment sequence {1,2}?

The answer is $\Theta(N^2)$. Let's first show the lower bound. By the conclusion of 7.3, we know that The 2-sort removes at most only three (i.e. $k=2$) inversions at a time. In addition, a pass with increment $h_k$ consists of $h_k$ insertion sorts of about $N/h_k$ elements. Then, by theorem 7.2, we know that the algorithm is $\Omega(N^2)$. By the same argument, the 2-sort is two insertion sorts of size $N/2$, so the cost of that pass is $O(N^2)$. The 1-sort is also $O(N^2)$, so the upper bound for the algorithm is $O(N^2)$.

MAW 7.9

Determine the running time (i.e. number of swaps) of Shellsort for

a. sorted input

$O(N \log N)$. No exachanges acutally done in each each pass but we will still need to go through the second for loop, which indicates that each pass takes $O(N)$. There are total $O(\log N)$ passes and the answer follows.

b. reverse-ordered input

$O(N \log N)$. It is easy to show that after an $h_k$ sort, no element is farther than $h_k$ from its rightful position. Thus, if the increments satisfy $h_{k+1} \le ch_k$ for a constant $c$, which implies $O(\log N)$ increments, then the bound follows.

However, one cannot talk about shellsort without specifying the increment sequence. If we assume the shell sequence (i.e. $N/2, N/4, \dots, 2, 1$), then the running time is $O(N^2)$ as suggested by this answer, which I'll copy below for future reference.

Shellsort is just a bunch of insertion sorts. For a given increment $I$, there will be $I$ subarrays to sort by insertion, each of length $N/I$. We know that insertion sort requires time $O(m^2)$ to sort a reverse-sorted array of length $m$. Here, $m$ will be ($N/I$) for each subarray. Thus one subarray will cost $(N/I)^2$ to sort. There are $I$ subarrays, so the total cost will be $I * (N/I)^2 = N^2/I$. But that is the cost just for a single increment. The total time for all of the iterations must be $N^2/(N/2) + N^2/(N/4) + N^2/(N/8) + \dots + N^2/2 + N^2/1 = 2N + 4N + \dots + N^2/2 + N^2/1$ . If we factor out an $N$, we get $N(2+4+\dots+N/2+N)$ . In parenthesis is the sum of powers of 2 from 2 to $N$, which is approximately equalt to $2N$. Therefore, the total cost is $N(2N) = 2N^2 = O(N^2)$.

MAW 7.10

Do either of the following modifications to the Shellsort routine coded in Fig. 7.4 affect the worst case running time?

a. Before line 2, subtract one from Increment if it is even.

The key improvement in terms of the worst case running time lies in the increment sequence. As suggested on p.224,225, we improve the worst time running time from $O(N^2)$ to $O(N^{3/2})$ by changing the increment sequence into the sequence that consecutive increments have no common factors.

If we follow the modification indicated by this question, it is still possible to have a case that we will have consecutive increments to share a common factor. For instance, if we sort an array with size $N = 45$, then with the modification, the increment sequence will be $45, 21 (22-1), 9, 3, 1$.

b. Before line 2, add one to Increment if it is even.

In this case, conseuctive increments are relatively prime and by the argument in the proof of theorem 7.4, we can have the worst case running time $O(N^{3/2})$.

MAW 7.11

Show how heapsort processes the input 142, 543, 123, 65, 453, 879, 572, 434, 111, 242, 811, 102.

The input is read in as it appears in the question. Then, we first build the heap with the result looks like

879, 811, 572, 434, 543, 123, 142, 65, 111, 242, 453, 102

$879$ is removed from the heap and placed at the end. We'll put | to separate the elements that are sorted and not part of the heap. $102$ is placed in the hole and bubbled down, obtaining

811, 543, 572, 434, 453, 123, 142, 65, 111, 242, 102, | 879

continuing the process, we obtain

572, 543, 142, 434, 453, 123, 102, 65, 111, 242, | 811, 879
543, 453, 142, 434, 242, 123, 102, 65, 111, | 572, 811, 879
453, 434, 142, 111, 242, 123, 102, 65, | 543, 572, 811, 879
434, 242, 142, 111, 65, 123, 102, | 453, 543, 572, 811, 879
242, 111, 142, 102, 65, 123, | 434, 453, 543, 572, 811, 879
142, 111, 123, 102, 65, | 242, 434, 453, 543, 572, 811, 879
123, 111, 65, 102, | 142, 242, 434, 453, 543, 572, 811, 879
111, 102, 65, | 123, 142, 242, 434, 453, 543, 572, 811, 879
102, 65, | 111, 123, 142, 242, 434, 453, 543, 572, 811, 879
65, | 102, 111, 123, 142, 242, 434, 453, 543, 572, 811, 879
| 65, 102, 111, 123, 142, 242, 434, 453, 543, 572, 811, 879

MAW 7.12

a. What is the running time of heapsort for presorted input?

Still $O(N\log N)$. Heapsort uses at least (roughly) $N\log N$ comparisons on any input, so there are no particularly good inputs.

MAW 7.13

Sort 3,1,4,1,5,9,2,6 using mergesort

First the sequence {3,1,4,1} is sorted. To do this, the sequence {3,1} is sorted. This involves sorting {3} and {1}, which are base cases, and merging the result to obtain {1,3}. The sequence {4,1} is likewise sorted into {1,4}. Then these two sequences are merged to obtain {1,1,3,4}. The second half is sorted similarly, eventually obtaining {2,5,6,9}. The merge result is then easily computed as {1,1,2,3,4,5,6,9}.

MAW 7.15

Determine the running time of mergesort for a. sorted input b. reverse-ordered input c. random input

The running time for mergesort is $O(N \log N)$ regardless of the input pattern.

MAW: Chapter 6 Reflection

2017-04-09T10:45:00+08:00

Reflection

In chapter 6, we learn about the priority queue ADT. We assume each item has a "priority" and the ADT allows us to insert the element into the queue and get the element with "highest" priority from the queue. The model looks like

and the big picture we have studied so far becomes:

In order to support two operations required by the model, we propose four new implmentations. Binary heap is the most commonly-seen one. Insert can be done in $O(\log N)$ in the worst case and constant on average. DeleteMin can be done in $O(\log N)$. However, it has natural drawback in supporting merge operation, which is needed when we want to merge two heaps into one. This leads to the leftist heap. Leftist heap supports all three operations (insert, DeleteMin, merge) efficiently but we needs to maintain extra information in the node and we need to do extra test during merge in order to maintain the leftist heap property. To better solve these two little disadvantages, we propose skew heap, which has no restriction on the tree structure at all while still enjoying the efficiency of operations in amortized time. However, there is still room for improvement because both leftist heap and skew heap cannot support constant on average insertion like binary heap does. This is why we propose a new structure called binomial queue.

There are many applications of priority queues:

Operating system task scheduler
Forward network packets in order of urgency
Select most frequent symbols for data compression
Sorting
Implementation for greedy algorithms

Left Out

Some material I left out when I work through this chapter:

6.9.c, 6.10, 6.12, 6.20, 6.21, 6.30, 6.32, 6.33, 6.34, 6.35, 6.36

Reference

MAW Chapter 6
https://courses.cs.washington.edu/courses/cse332/10sp/lectures/lecture4.pdf

Binomial queue

2017-04-08T23:33:00+08:00

This is the summary of binomial queue part in MAW Chapter 6.

Motivation

We want to have a data structure that support merging, insertion, and deleteMin in $O(\log N)$ time per operation, and at the same time, like binary heap, we want to have insertion takes constant time on average. The latter part is not possible with skew heap or leftist heap.

The data structure we have is called binomial queue.

Concept

Binomial queues is a collection of heap-ordered trees. Each of the heap-ordered trees is called a binomial tree with the following constraints:

There is at most one binomial tree of every height.
A binomial tree of height 0 is a one-node tree; a binomial tree, $B_k$, of height $k$ is formed by attaching a binomial tree, $B_{k-1}$, to the root of another binomial tree, $B_{k-1}$.

The picture below shows a binomial queue consisting of six elements with two binomial trees $B_1$ and $B_2$:

Properties

A binomial tree $B_k$, consists of a root with children $B_0, B_1, \dots, B_{k-1}$.
Binomial trees of height $k$ have exactly $2^k$ nodes
The number of nodes at depth $d$ is the binomial coefficient ${k \choose d}$.
A priority queue of any size can be represented by a collection of binomial trees. For instances, a priority queue of size 13 could be represented by $B_3, B_2, B_0$ ( $13 = 2^3 + 2^2 + 2^0$ ). Thus, we can write this representation as $1101$, which not only represents $13$ in binary but also represents the fact that $B_3, B_2, B_0$ are present and $B_1$ is not.

Operations

Merge

The merge is performed by essentially adding the two queues together. Let's illustrate through merging two binomial queues $H_1$ and $H_2$ shown below:

If you will, $H_1$ can be represented as $0110_{2}$ and $H_2$ can be represented as $0111_{2}$. Thus, merge is just adding two binary number together, and we have $1101_2$. This implies that our final result contains $B_0, B_2, B_3$. The actual merge step is implied by the binomial tree constraint mentioned above:

A binomial tree of height 0 is a one-node tree; a binomial tree, $B_k$, of height $k$ is formed by attaching a binomial tree, $B_{k-1}$, to the root of another binomial tree, $B_{k-1}$.

Thus merge of the two $B_1$ trees in $H_1$ and $H_2$ looks like:

and the final result of merging looks like:

Insertion

Insertion is just a special case of merging, since we merely create a one-node tree and perform a merge.

DeleteMin

find the binomial tree with the smallest root. Let this tree be $B_k$, and let the original priority queue be $H$.
Remove the binomial tree $B_k$ from the forest of trees in $H$, forming the new binomial queue $H'$.
Remove the root of $B_k$, creating binomial trees $B_0, B_1, \dots, B_{k-1}$, which collectively form priority queue $H''$.
merge $H'$ and $H''$.

Suppose we perform a DeleteMin on $H_3$ from above. The minimum root is 12, and we have $H'$ and $H''$ below:

and our final result is ¹:

Runtime analysis

Merge

Since merging two binomial trees takes constant time with almost any reasonable implementation, and there are $O(\log N)$ binomial trees (think of representing the size of priority queue in terms of binary, and we need to do $O(\log N)$ division), the merge takes $O(\log N)$ time.

Insertion

The worst-case time of this operation is $O(\log N)$. However, this actually can be constant on average. Details see MAW p.205.

DeleteMin

We take $O(\log N)$ time to find the tree containing the minimum element. We take constant time to create the queues $H'$ and $H''$. Merging these two queues takes $O(\log N)$ time and thus, the operation overall takes $O(\log N)$.

Reference

MAW Chapter 6

For actual implementation details, please see MAW p. 208 - 211. ↩

Skew heap

2017-04-05T23:33:00+08:00

This is the summary of skew heap part in MAW Chapter 6.

Motivation

Like the relation between splay trees and AVL trees, we want to have $O(\log N)$ amortized cost per operation. In addition, we don't want to have any auxiliary information stored at the nodes. In other words, we want to trade strict $O(\log N)$ operation for less space we need to use for the data structure. In this case, like splay trees to AVL trees, we have Skew heaps to leftist heaps.

Concept

Skew heaps are binary trees with heap order, but there is no structural constraint on these trees. This means that we don't need the binary tree to be complete (i.e. binary heap) or left heavy (i.e. leftist heap).

In addition, we don't store $Npl$ information in the node.

Properties

A perfectly balanced tree forms if the keys $1$ to $2^k-1$ are inserted in order into an initially empty skew heap.

Operations

Skew heap is extremely similar with leftist heap in terms of merge operation. There is only one difference: for leftist heap, we check to see whether the left and right children satisfy the leftist heap order property and swap them if they do not. However, for skew heaps, the swap is unconditional. In other words, we always swap the left & right subtrees at each step of merge.

In the below example, we want to merge two skew heaps $H_1$ and $H_2$:

Then, we get the following result of merging $H_2$ with $H_1$'s right subheap:

and this is the final merge result:

Note

The end result is actually leftist heap but there is no guaranteed that this is always the case. If you take a look, $H_1$ is not lefist heap.

Runtime analysis

merge, deleteMin, and insert are all running in $O(\log N)$ amortized time.

Links to resources

Here are some of the resources I found helpful while preparing this article:

MAW Chapter 6
CMU lecture slides

Leftist heap

2017-04-04T10:30:00+08:00

This is the summary of leftist heaps part in MAW Chapter 6.

Motivation

Merge two priority queues into one can be a very hard operation to do. For binary heap, this can be done at $O(N)$. However, we want to do better. Leftist heap is a priority queue that supports merge operation in $O(\log N)$.

Concept

The idea for leftist heap is that we want to make the tree structure imbalance as much as possible to make merge fast. This is achieved by leftist heap property.

null path length $Npl(X)$ of any node $X$ is the length of the shortest path from $X$ to a node without two children. Thus, the $Npl$ of a node with zero or one child is 0 and $Npl(NULL) = -1$. In addition, the $Npl$ of any node is 1 more than the minimum of the $Npl$ of its children.
leftist heap property is that for every node $X$ in the heap, the $Npl$ of the left child is at least as large as that of the right child.

In fact, the leftist heap property is the leftist property applies to heap. In other words, if every node in a tree has the $Npl$ of the left child is at least as large as that of the right child, then we call this tree a leftist tree. A leftist heap is simply a leftist tree with keys in heap order.

The number in the each node below is the $Npl$ of that node. By the leftist property, only the left tree is leftist.

Properties

If rightmost path of leftist tree has $r$ nodes, then the whole tree has at least $2^r-1$ nodes.

The above property leads to: $n \ge 2^r-1$, so $r$ is $O(\log N)$. Since our fundamental operation merge will perform all the work on the right path, then we can have a $O(\log N)$ merge operation.

A perfectly balanced tree forms if keys 1 to $2^k-1$ are inserted in order into an initially empty leftist heap.

Operations

`merge(H1, H2)`

As with splay in splay trees, merge is the fundamental operation that is used to implement other operations in leftist heap(i.e., insert, deleteMin).

The key point for the merge operation are:

recursively merge the heap with the larger root with the right subheap of the heap with the smaller root.
We update $Npl$ of the merged root and swap left and right subtrees just below root, if needed, to keep leftist property of merged result.

The following picture shows a good example of merge steps. Note that the $Npl$ of the node in picture is 1 larger than our's definition. The blue curve represents the final swap step.

Another example can be seen from MAW 6.16 in my chapter 6 writing question post.

The actual implementation in C is below, which is copied from maw p.198:

PriorityQueue
Merge(PriorityQueue H1, PriorityQueue H2)
{
  if (H1 == NULL) return H2;
  if (H2 == NULL) return H1;
  if (H1->Element < H2->Element) return Merge1(H1, H2);
  if (H1->Element > H2->Element) return Merge1(H2, H1);
}

static PriorityQueue
Merge1(PriorityQueue H1, PriorityQueue H2)
{
  if (H1->Left == NULL) H1->Left = H2; // Single node; H1->Right is already NULL
  else
  {
    H1->Right = Merge(H1->Right, H2);
    if(H1->Left->Npl < H1->Right->Npl) swapChildren(H1);
    H1->Npl = H1->Right->Npl + 1;    
  }    
  return H1;
}

insert

We can carry out insertion by making the item to be inserted a one-node heap and perform a merge.

Reference section offers a link to visualize the whole insertion process. The actual implementation is on maw p.199 and copied below:

PriorityQueue
Insert1(ElementType X, PriorityQueue H)
{
  PriorityQueue SingleNode;

  SingleNode = malloc(sizeof(struct TreeNode));
  assert(SingleNode);

  SingleNode->Element = X; SingleNode->Npl = 0;
  SingleNode->Left = SingleNode->Right = NULL;
  H = merge(SingleNode, H);
  return H;    
}

deleteMin

deleteMin can be done by remove the root and merge the left and subtree tree into a new leftist heap.

The actual implementation is on maw p.200 and copied below:

PriorityQueue
DeleteMin(PriorityQueue H)
{
  PriorityQueue LeftHeap, RightHeap;
  if(IsEmpty(H))
  {
    Error("Priority queue is empty");
    return H;    
  }    
  LeftHeap = H->Left;
  RightHeap = H->Right;
  free(H);
  return Merge(LeftHeap, RightHeap);
}

BuildHeap

As described in MAW 6.22, we can perform BuildHeap in linear time for leftist heaps by considering each element as a one-node leftist heap, placing all these heaps on a queue, and performing the following step: Until only one heap is on the queue, dequeue two heaps, merge them, and enqueue the result.

This algorithm is $O(N)$ in the worst time.

Runtime analysis

merge, deleteMin, and insert are all running in $O(\log N)$.

Reference

MAW Chapter 6
http://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/heaps.pdf
https://www.cs.usfca.edu/~galles/visualization/LeftistHeap.html (good tool to visualize the operations)
http://courses.cs.washington.edu/courses/cse326/08sp/lectures/05-leftist-heaps.pdf

Binary heap

2017-04-02T11:30:00+08:00

This is the summary of binary heap and its generalization d-heap part in MAW Chapter 6.

Motivation

The motivation for priority queue majorly comes from the fact that not all things are equally weighted. I'll summarize the applications of priority queues in my end-chapter summary post.

Concept

A binary heap is a binary tree (NOT a BST) that is:

Complete (structure property):

the tree is completely filled except possibly the bottom level, which is filled from left to right.

satisfies the heap order property:

For every nodex $X$, the key in the parent of $X$ is smaller than (or equal to) the key in $X$, with the exception of the root (which has no parent). In other words, every node is less than or equal to its children.

This property guarantees that the root node is always the smallest node ¹.

Here are some examples:

Properties

Since complete binary tree of height $h$ has between $2^h$ and $2^{h+1}-1$ nodes, the height of a binary heap is $O(\log N)$.
For binary heaps, BuildHeap does at most $2N-2$ comparisons between elements.

Remarks on implementation

We use array as the actual implementation for the binary heap above. For any element in array position $i$, the left child is in position $2i$, the right child is in the cell after the left child $(2i+1)$, and the parent is in position $\lfloor i/2 \rfloor$. Position 0 is used as a sentinel.

The reason we use the array implementation is that dealing with pointers are quite expensive to do.

Operations

Insert

We add the value as the new node at the end of the array, which is the next avaliable location in the tree. Then, we need to maintain the heap order property by doing a simple insertion sort operation on the path from the new place to the root to find the correct place for it in the tree. This is called percolate up ².

We start at last node and keep comparing with parent $A[i/2]$
If parent larger, copy parent down and go up one level
Done if parent $\le$ item or reached top node $A[1]$

DeleteMin

We delete and return the value at root node in this operation. Same as the insert, we need to maintain the binary heap properties.

By removing the root node's value, we have a "hole" at the root. We use the last node's value in the tree to fill in the hole. By doing this way, we maintain the structure property. Now, we need to maintain the heap order property. Similar to insertion, we can do a simple insertion sort-like operation to find the correct place for it in the tree. This is called percolate down.

Keep comparing with children $A[2i]$ and $A[2i+1]$
Copy smaller child up and go down one level
Done if both children are $\ge$ item or reached a leaf node

Other heap operations

The following operations (with $P$ argument) require the position of every element in the heap known by some other method in order to make them cheap to perform.

`DecreaseKey`(P, $\delta$, H)

decrease the key value of node at position $P$ by a positive amount $\delta$. We can first subtract $\delta$ from current value at $P$. Then we percolate up to fix. This requires $O(\log N)$ time.

`IncreaseKey`(P, $\delta$, H)

increase the key value of node at position $P$ by a positive amount $\delta$. We can add $\delta$ to current value at $P$ then percolate down to fix. This requires $O(\log N)$ time.

`Delete(P,H)`

removes the node at position $P$ from the heap. We can use DecreaseKey(P, $\infty$, H) followed by DeleteMin. The running time is $O(\log N)$.

`Buildheap(H)`

takes as input $N$ keys and construct a binary heap from it. This is known as Floyd's algorithm.

Place the $N$ keys into the tree in order. This satisfies the structure property.
Then we do the following to maintain the heap order property.

for( i = N/2; i > 0; i--)
  PercolateDown(i);

This alogrithm runs in $O(N)$ time. Detailed proof see MAW p.189.

`Merge(H1,H2)`

We merge two heaps $H1$ and $H2$ of size $O(N)$. $H1$ and $H2$ are stored in two arrays. We can do $O(N)$ insert but this requires $O(N\log N)$ time. We can do better by copy $H2$ at the end of $H1$ and use BuildHeap. This requires $O(N)$ time ³.

Runtime analysis

Space: $O(N)$ (an array of size $N+1$)
Insert: $O(\log N)$
DeleteMin: $O(\log N)$

d-heaps

d-heaps is the generalization of binary heap: we have $d$ children instead of 2. Similar to B-tree, this structure will makes the heaps shallower and is useful for heaps too big for memory.

Everything is same to the binary heap except that it takes $d-1$ comparisons to find the minimum of $d$ children (in binary heap, we do comparison once). Then, for DeleteMin, for example, takes $O(d\log_d N)$. Other operations runtime adjusts similarly.

In terms of array implementation, for entry located in position $i$, the parent is at $\lfloor{\frac{i + (d-2)}{d}}\rfloor$ and the children are at $id-(d-2), \dots, id+1$.

Links to resources

Here are some of the resources I found helpful while preparing this article:

MAW Chapter 6
Lecture slides 4, 8, and 11 from U.Washington

The heap order property is for min heap. If you want to have a max heap, then the heap order property should be that every node is greater than or equal to its children. ↩
Position 0 is used as a sentinel, which holds the value that is smaller than (or equal to) any element in the heap. This is because every iteration of insert needs to test: 1. if it has reached the top node A[1] 2. if parent $\le$ item The first test can be avoid by using sentinel b/c it then becomes a special case of second test. ↩
As shown on MAW p.183, empirical study shows that on average, percolation terminates early: average insert moves an element up 1.607 levels. This means that binary heap support insertion in constant average time per operation. ↩

MAW Chapter 6: Priority Queues (Heaps) writing questions

2017-03-26T12:01:00+08:00

Solutions

including: MAW 6.6, 6.7, 6.9, 6.13, 6.14, 6.16, 6.17, 6.27, 6.28,

MAW 6.6

How many nodes are in the large heap in Figure 6.13?

This question is interesting in the sense that the algorithm of counting reflecting the underneath implemenation structure. Since the binary heap is actually implemented in terms of array, we start with $i = 1$ and position at the root. We follow the path toward the last node, doubling $i$ when taking a left child, and doubling $i$ and adding one when taking a right child. Then, we have the following calculation: $2(2(2(2(2(2(2i+1)+1)))))+1 = 225$. The picture below shows the path from the root to the node in the last position:

MAW 6.7

b. show that a heap of eight elements can be constructed in eight comparisons between heap elements.

Thie question is interesting because it offers another method we can use when build a binary heap with even number of elements. That is, we build binomial queue first. since the binary form of $8$ is $1000_2$, this means we will have only one binomial tree $B_3$ inside the binomial queue. Once we construct this binomial tree, we need one last step to tweek the binomial tree to follow binary heap property, namely each node has to have either zero or two children.

For this question, it takes seven comparisons to construct the binomial queue (with a solo binomial tree) and we get the following:

Then we need to restore the binary heap property because "a" node has three children. This can be done by the eighth compariosn between "b" and "c". If "c" is less than "b", then "b" is made a child of "c". Otherwise, both "c" and "d" are made children of "b".

MAW 6.9

a. Give an algorithm to find all nodes less than some value, X, in a binary heap. Your algorithm should run in $O(K)$, where $K$ is the number of nodes output.

The big idea is that we perform a preorder traversal of the heap. In detail, we start from the root of the heap. If value of the root is smaller than $X$, then we output this value and call procedure recursively once for its left child and once for its right child. If the value of a node is bigger or equal than $X$, then the procedure halts without printing the value. We don't need to check the children by heap definition.

The complexity of this algorithm is $O(N)$ in worst case, where $N$ is the total number of nodes in the heap. This happens when every node in the heap is smaller than $X$, and the procedure has to call each node of the heap.

b. Does your algorithm extend to any of the other heap structures dicussed in this chapter?

Yes. It works for leftist heap, skew heap, and d-heaps.

MAW 6.13

If a d-heap is stored as an array, for an entry located in position $i$, where are the parents and children?

Let's begin with children. Assume that position $i$ corresponds to the $X$th node of level $l$. Therefore

$$ i = \sum_{j=0}^{l-1}d^j+X $$

$\sum_{j=0}^{l-1}d^j$ is a geometric series whose first term equals $1$, whose common ratio is $d$, and that contains $l$ terms in total. Thus, the result is $\frac{d^l-1}{d-1}$ and thus, we have

$$ i = \frac{d^l-1}{d-1} + X $$

We now calculate the position of $i$'s second last child in terms of $d$, $l$, and $X$. This equals $i$, plus the number of nodes after $i$ on level $l$, plus $d$ times the number of nodes before $i$ on level $l$, plus $d-1$.

$$ \begin{eqnarray*} &=& \frac{d^l-1}{d-1} + X + d^l - X + (X-1)d + d - 1 \\ &=& \frac{d^l-1}{d-1} + d^l-1 + dX \\ &=& \frac{d(d^l-1)}{d-1} + dX \\ &=& d(\frac{d^l-1}{d-1} + X) \\ &=& di \end{eqnarray*} $$

Therefore the second last child of $i$ is in position $id$. It follows that the children of $i$ are in positions $id-(d-2), \dots, id+1$.

A node is a child of $i$ if and only if it is in one of the positions $id-(d-2), \dots, id+1$. So what you want here is a function that will map each of these to $i$, but will not map any other value to $i$. Let $j$ be any of these values. Clearly,

$$ \lfloor{\frac{j + (d-2)}{d}}\rfloor = i $$

But if $j$ is greater than $id+1$ or less than $id - (d-2)$ then

$$ \lfloor{\frac{j + (d-2)}{d}}\rfloor \ne i $$

Thus we have our function which can now be used to work out the position of the parent of $i$.

$$ \lfloor{\frac{i + (d-2)}{d}}\rfloor $$

MAW 6.14

Suppose we need to perform $M$ PercolateUp and $N$ DeleteMiin on a d-heap that initially has $N$ elements.

a. What is the total running time of all operations in terms of $M$, $N$, and $d$?

A percolateUp operation on a d-heap with $N$ elements takes $O(\log_d N)$ steps. The key is that each time we bubble the hole up, we only do comparison once: compare the insertion value with the parent of the hole (Figure 6.6, 6.7 helps understanding).

A deleteMin operation on a d-heap with $N$ elements takes $O(d \log_d N)$ steps. Here, we need to feel the empty hole with the minimum value of its children. This can take $d$ comparison to find the minimum (see p.184).

Thus in total this will take $O(M\log_d N + Nd\log_d N)$ steps.

b. If $d = 2$, what is the running time of all heap operations?

Substitute 2 into the formula calculated in part a) gives $O((M+N)\log_2 N)$.

c. If $d = \theta (N)$, what is the total running time?

If $d = \theta (N)$ then $d = cN$, where $c$ is a constant value independent of $N$. Substituting $cN$ into the formula calculated in part a) gives:

$$ M\log_{cN} N + NcN \log_{cN}N = O(M + N^2) $$

d. What choice of $d$ minimizes the total running time?

$d = max(2, M/N)$ (See the related discussion at the end of Section 11.4)

MAW 6.16

Merge the two leftist heaps in Figure 6.58

The book doesn't do a well job on displaying the detailed steps in merging the leftist heap. So, I decide to use this problem as an illustration. By algorithm description on p. 194 and the actual algorithm implementation on p.189., there are two key points in the algorithm:

recursively merge the heap with the larger root with the right subheap of the heap with the smaller root.
We do the swap at the root.

The following shows the steps to get the final answer for this problem:

MAW 6.17

Show the result of inserting keys 1 to 15 in order into an initially empty leftist heap.

Use this wonderful site to see the whole process of insertion.

MAW 6.27

Prove that a binomial tree $B_k$ has binomial trees $B_0, B_1, \dots, B_{k-1}$ as children of the root.

I'll try to use two ways to prove this. Both ways are by induction but one of them is more mathematical formula involved.

Method 1

Clearly the claim is true for $k = 1$. Suppose it is true for all values $i = 1, 2, \dots, k-1$. Since for $B_k$, we have $2^k$ nodes. Then, by the induction hypothesis, we have $2^{k-1} = 1 + 2^0 + \dots + 2^{k-2}$. Now, multiplying both sides of the equation by 2, we have $2^k = 2 + 2 + \dots + 2^{k-1}$, which is the same as $2^k = 1 + 2^0 + \dots + 2^{k-1}$. This completes the proof.

Method 2

Again the claim is true for $k = 1$. Suppose it is true for all values $i = 1, 2, \dots, k-1$. A $B_k$ tree is fromed by attaching a $B_{k-1}$ tree to the root of a $B_{k-1}$ tree. Thus, by induction, it contains a $B_0$ through $B_{k-2}$ tree, as well as the newly attached $B_{k-1}$ tree, proving the claim.

MAW 6.28

Prove that a binomial tree of height $k$ has ${k \choose d}$ nodes at depth $d$.

Proof is by induction. Clearly the claim is true for $k=1$. Assume true for all values $i=1,2,\dots,k$. A $B_{k+1}$ tree is formed by attaching a $B_k$ tree to the original $B_k$ tree. The original tree has ${k \choose d}$ nodes at depth $d$ by induction hypothesis. The attached tree had $\binom{k}{d-1}$ nodes at depth $d-1$, which are now at depth $d$. Adding these two terms we have

$$ \binom{k+1}{d} = \binom{k}{d} + \binom{k}{d-1} $$

Hash Table

2017-03-17T15:56:00+08:00

This post summarizes the basic idea about hash table. It is created based on MAW Chapter 5.

Reflection

Motivation

In the previous chapter, we implement the dictionary (map) ADT using tree structure. A typical find or insert operations require $O(\log N)$ time. However, this is not good enough compared with $O(1)$ time. This is place where hash tables implementation can shine. Hash tables is a data structure that is designed for $O(1)$ find, insert, and delete operations. The only downside compared hash tables with tree structure is the ill support for ordering elements.

General Idea

A hash table is an array of some fixed size. Then we use hash function to map each key (i.e. a key is a string with an associated value) into some number in the range 0 to Tablesize-1 and place it in the appropriate cell.

Ideally, two distinct keys get different cells. However, this is not possible because there are a finite number of cells and a virtually infinite supply of keys. Thus the key concern for hash table data structure is how to distribute the keys evenly among the cells. This issue is addressed from two ways: 1) pick a good hash function to avoid collision (i.e. two keys hash to the same value) 2) use a good strategy to redistribute keys when collision happens.

Hash function

Hash function is a mapping from the element key (string or number) to an integer (the hash value). The output of the hash function must always be less than the size of array and should be as evenly distributed as possible. One thing to note here is that the pick of hash function has high dependency on the actual content of the key set.

We list some key points from the chapter here:

Choose the table size of the hash table to be the prime.

The rationale for this is that real-life data tends to have pattern, and "multiples of 61" are probably less likely than "multiples of 60". In addition, quadratic hashing is an efficient collision strategy to use (compared with linear hashing, double hashing) and it requires the table size to be the prime.

When we deal with string keys, we may use $\big(\sum_{i=0}^{k-1} s_i \cdot 256^i \big) \bmod TableSize$ as our function.

Here we use 256 because char data is 1 byte. Other hash function may be adding up the ASCII values ofthe characters in the string. However, this doesn't work well because if string keys are short, it will not hash evenly to all of the hash table (see MAW p.151) and different character combinations hash to same value (i.e. "abc", "bca" all add up to the same value).

The slides listed in the reference section offer some examples on hash function pick if we know the keys beforehand (i.e. if keys $s$ are the real numbers uniformly distributed over $0 \leq s < 1$).

Collision strategy

A collision occurs when two different keys hash to the same value. By the nature of dictionary ADT, we cannot store both data records in the same cell in the array. So, we need to come up a strategy to resolve collision and try our best to make the keys evenly distribute among the table. There are two main strategies discussed in the chapter: separate chaining and open addressing.

Note

Load factor $\lambda$ defined as the ratio of the number of the elements in the hash table to the table size. This concept appears frequently when we analyze hash table collision resolution strategy.

Separate chaining

Note

Sometimes separate chaining is also referred as "open hashing", which means that none of the objects are stored inside the hash table's internal array. For example, objects are stored in lists of buckets for hash table implementng the separate chaining.

Separate chaining is to keep a list (chain or bucket) of all elements that hash to the same value. In other words, each hash table cell holds pointer to linked list of records with same hash value. When collision happens, we insert the hash value of the key to the corresponding linked list of the cell that hash value should be stored. When we want to find an item, we compute the hash value, then do find on linked list.

Here the worst case time happens for find operation, which can take linear time. However, this can happen in the extreme rare case (bad luck or bad hash function). Of course, we can build a balance tree instead of a linked list on each cell to shrink the find time. But, the structure overhead and the compelxity of insert may make this effort not worth it.

The average length of chained list = $\lambda$. Thus, the average time for accessing an item = $O(1) + O(\lambda)$. So, we want $\lambda$ to be smaller than 1 but close to 1 if good hashing function. Thus, the general rule for separate chaining hashing is to make the table size about as large as the number of elements expected (let $\lambda \approx 1$).

Note

Actually, separate chaining doesn't restrict us to use the list to chain the objects together. We can use a tree to organize the elements that have the same hash value.

Open addressing

Note

Open addressing is sometimes referred as "closed hashing", which means that every object is stored directly at some index in the hash table's internal array; the objects never live outside of the internal array.

One disadvantage for the separate chaining strategy is that we need to build linked list for each cell, whcih introduces the overhead that can waste space. Another strategy to resolve collision is to try other empty cells. This is called open addressing. In general, open addressing means resolving collisions by trying a sequence of other positions in the table. Trying the next spot is called probing. More formally, cells $h_0(X), h_1(x), h_2(x), \dots$ are tried in succession until either x is found or we find an empty location (x not present). $h_i(x) = (Hash(x) + F(i)) \bmod TableSize$, with $F(0) = 0$. The function $F$ is the collision resolution strategy.

Various flavors of open addressing differ in which probe sequence they use. This is reflected in $F$. Three types of resolution function are discussed in the book:

Linear probing: $F(i) = i$
Quadratic probing: $F(i) = i^2$
Double hashing: $F(i) = i \cdot Hash_2(x)$

Generally, the load factor should be below $\lambda = 0.5$ for open addressing hashing.

Linear probing

With linear probing, we try the cells sequentially (with wraparound) insearch of an empty cell. This strategy has a fundamental problem called primary clustering, which means blocks of occupied cells start forming. Any key that hashes into the cluster will require several attempts to resolve the collision, and then it will add to the cluster. In other words, primary clustering means elements that hash to different cells probe same alternative cells.

Quadratic probing

Quadratic probing is a collision resolution method that eliminates the primary clustering problem of linear probing. But it has its own restriction or problem:

If quadratic probing is used and the table size is prime, then a new element can always be inserted if the table is at least half empty. However, insertion is not guaranteed if $\lambda > 0.5$.
Secondary clustering, which means elements that hash to the same position will probe the same alternative cells.

Double hashing

Double hashing $F$ says that we apply a second hash function to x and probe at a distance $hash_2(x), 2hash_2(x), \dots$, and so on.

When $\lambda$ exceeds certain value, we need to build a bigger hash table of approximately twice the size and be prime. This is called rehashing.

In addition, when hash table cannot be contained in the memory and have to store part of structure on the disk, then the disk I/O becomes the main cost. In this case, we use different hash scheme, which is called extendible hashing. Like B-tree, this structure is widely applied in the database field.

Reference

MAW Chapter 5
Washington lecture slides: CSE 332 Lecture 10, CSE 373 Lecture 16
SO: Meaning of Open hashing and Closed hashing

MAW Chapter 5: Hashing writing questions

2017-03-16T17:41:00+08:00

Solutions

including: MAW 5.4, 5.5, 5.6, 5.10, 5.11

MAW 5.4

A large number of deletions in a separate chaining hash table can cause the table to be fairly empty, which wastes space. In this case, we can rehash to a table half as large. Assume that we rehash to a larger table when there are twice as many elements as the table size. How empty should the table be before we rehash to a smaller table?

We must be careful not to rehash too often. Let $p$ be the threshold (fraction of table size) at which we rehash to a smaller table. Then, if the new table has size $N$, it contains $2Np$ elements. This table will require rehashing after either $2N-2Np$ insertions or $pN$ deletions. Then, we don't want to do rehashing either after a few insertion or a few deletions. A good strategy is to set $2N-2Np$ equals to $pN$ and we get $p = \frac{2}{3}$. For instance, suppose we have a table of size 300. If we rehash at 200 elements, then the new table size is $N = 150$, and we can do either 100 insertions or 100 deletions until a new rehash is required.

If we know that insertions are more frequent than deletions, then we might choose $p$ to be somewhat larger. All in all, we play around the relation between $2N-2Np$ and $pN$ depends on which operation we favorite.

MAW 5.5

An alternative collision resolution strategy is to define a sequence, $F(i) = r_i$, where $r_0 = 0$ and $r_1, r_2, \dots, r_N$ is a random permutation of the first $N$ integers (each integer appears exactly once).

a. Prove that under this strategy, if the table is not full, then the collision can always be resolved.

Since the sequence $F(i)$ is defined as a random permutation of the first $N$ integers, then each cells of the table will be probed eventually. If the table is not full, then the collision can always be resolved.

b. Would this strategy be expected to eliminate clustering?

This seems to eliminate primary clustering but not secondary clustering because all elements that hash to some location will try the same collision resolution sequence.

c. If the load factor of the table is $\lambda$, what is the expected time to perform an insert and for a successful search?

The running time is probably similar to quadratic probing. The advantage here is that the insertion can't fail unless the table is full.

MAW 5.6

What are the advantages and disadvantages of the various collision resolution strategies?

Separate chaining hashing requires the use of pointers, which costs some memory, and the standard method of implementing calls on memory allocation routines, which typically are expensive.

Linear probing is easily implemented, but performance degrades severly as the load factor increases because of primary clustering.

Quadratic probing is only slightly more difficult to implement and gives good performance in practice. An insertion can fail if the table is half empty, but this is not likely. Even if it were, such an insertion would be so expensive that it wouldn't matter and would almost certainly point up a weakness in the hash function.

Double hashing eliminates primary and secondary clustering but the computation of a second hash function can be costly.

MAW 5.10

Describe a procedure that avoids initializing a hash table (at the expense of memory).

To each hash table slot, we can add an extra field that we'll call WhereOnStack, and we can keep an extra stack. When an insertion is first performed into a slot, we push the address (or number) of the slot onto the stack and set the WhereOnStack field to point to the top of the stack. When we access a hash table slot, we check that WhereOnStack points to a valid part of the stack and that the entry in the (middle of the) stack that is pointed to by the WhereOnStack field has that hash table slot as an address.

MAW 5.11

Suppose we want to find the first occurrence of a string $P_1P_2\dots P_k$ in a long input string $A_1A_2\dots A_N$. We can solve this problem by hashing the pattern string, Obtaining a hash value $H_p$, and comparing this value with the hash value formed from $A_1A_2\dots A_k$, $A_2A_3\dots A_{k+1}$, $A_3A_4\dots A_{k+2}$, and so on until $A_{N-k+1}A_{N-k+2}\dots A_N$. If we have a match of hash values, we compare the string character by character to verify the match. We return the position (in A) if the strings actually do match, and we continue in the unlikely event that the match is false.

a. Show that if the hash value of $A_iA_{i+1}\dots A_{i+k-1}$ is known, then the hash value of $A_{i+1}A_{i+2}\dots A_{i+k}$ can be computed in constant time.

As suggested by MAW p.151, we use $\sum_{i=0}^{KeySize-1} Key[KeySize-i-1]\cdot 32^i$ as the function to compute the hash value of a given string. Then, by this definition, $A_iA_{i+1}\dots A_{i+k-1}$ can be computed as

$$ H_1 = (32^0A_i + 32^1A_{i+1} + \dots + 32^{k-1}A_{i+k-1}) \bmod N $$

similarly, $A_{i+1}A_{i+2}\dots A_{i+k}$ can be computed as

$$ H_2 = (32^1A_{i+1} + \dots + 32^kA_{i+k}) \bmod N $$

If we take a look at the relationship between these two equations, we can see

$$ H_2 = H_1 - 32^0A_i \bmod N + 32^kA_{i+k} \bmod N $$

This can be computed in constant time if $H_1$ is known.

b. Show that the running time is $O(k+N)$ plus the time spent refuting false matches.

The pattern's hash value $H_p$ computed in $O(K)$ time. Then, $A_1A_2\dots A_k$ is computed in $O(K)$ time. Then starting with $A_2A_3\dots A_{k+1}$ and until $A_{N-k+1}A_{N-k+2}\dots A_N$, each hash value is computed in $O(1)$ by a) above. Since, there are $N-k+1-2+1$ terms of $O(1)$, then the total running time is $O(K) + O(K) + O(N-K) = O(N+K)$. Of course, there is also time we spend on refuting false matches.

MAW: Chapter 4 Reflection

2017-03-12T10:45:00+08:00

Reflection

In Chapter 4, we learn about the tree data structure. If we take a look from ADT perspective, the ADT we learn about is called Dictionary (a.k.a Map) ADT, which represented by a set of (key, value) pairs that support insert, find, delete operations. The core idea for this ADT, as you can imagine, is to store information according to some key and be able to retrieve it efficiently. Now our big picture becomes:

Implementing dictionary ADT with tree structure brings following advantages:

Trees reflect structural relationships in the data
Trees are used to represent hierarchies
Trees provide an efficient insertion and searching
Trees are very flexible data, allowing to move subtrees around with minumum effort

Many different tree structures get presented in this chapter. Most fundamental ones are:

Some variations to the above structures are Order Statistic Tree (MAW 4.44), Threaded Tree (MAW 4.45), k-d Tree (MAW 4.46), and B*-tree (MAW 4.38).

Left Out

Some material I left out when I work through this chapter:

4.11 (cursor implementation of trees is not top priority)
4.33, 4.34 (Form a nice tree drawing project; don't have time to do them now)
4.12, 4.13, 4.14, 4.26.b, 4.37.b;c, 4.38 (interesting but may require too much effort)

Reference

https://courses.cs.washington.edu/courses/cse332/10sp/lectures/lecture6.pdf

B-Tree

2017-03-11T21:32:00+08:00

This is the summary of B tree part in MAW Chapter 4.

Motivation

So far we have assumed that an entire data structure can be stored in the main memory. However, this is not true in reality because if we have more data than can fit in main memory, we have to store data structure on disk. In this case, number of disk accesses will dominate the running time because they are very expensive comparing with the processor speed. Then, when we design a data structure, we have to try our best to minimize the number of disk accesses. Under the context of tree structure, B-tree is a structure that tries to read as much information as possible in every disk access operation.

Concept

B-tree is by far the most chaoatic defined structure in the sense that different people have slightly different definitions. I'll follow MAW's definition and points out how MAW's definition is different from the other commonly seen definition.

A B-tree of order $M$ is a tree with the following structural properties:

The leaves contain all the actual data, which are either the key temselves or pointers to records containing the keys.
The root is either a leaf (when tree has $\le L$ items) or has between $2$ and $M$ children.
All nonleaf nodes (except the root) have between $\lceil{M/2}\rceil$ and $M$ children (at least half full).
All leaves are the same depth and have between $\lceil{L/2}\rceil$ and $L$ sorted data items, for some L (at least half full).
The nonleaf nodes have room for up to $M-1$ keys to guide the searching; key $i$ represents the smallest key in subtree $i+1$.

In MAW, the definition of B-tree is essentially known as $B^+-$tree. Technically, the real B-tree has the key property that the actual data to be stored in both leaves and internal nodes, which is not the case in our definition. In some $B^+-$tree definition, the leaves are connected as a linked list so that we don't have to restart the search from the root once we already traverse down to the leaf if we want a record that is on a leaf really close to the leaf we currently at.

Examples

Some typical examples in B-tree are of order 4 (known as 2-3-4 tree) and 3 (known as 2-3 tree).

B-tree is a structure that is widely used in the database system. The following picture shows a more real B-tree example. Suppose we have a large customer table with gigabytes of data and an index is created on the phone number column of the customer table to speed up search. Phone numbers stored in sorted order with information (page and slot) on where the rest of the customer information can be found in the customer table.

In this example, once we continue down the tree and locate the phone number we are searching for, we use the RID to fetch the rest of the customer record from the table. In this case, we use 4 page accesses to get the full customer record from the table.

Operations

Find:

For find, we basically do binary search on each node to decide what subtree we should go to search.

Insertion:

The major unique manipulation is we may need to split the node at the leaves and recursives make the new parent nodes (by pushing a key up to its parent) to the root. Other strategies regarding nodes overloaded also exist but this one is classic textbook.

Deletion:

I have to say deletion is the most messy one and people may always want to talk about it conceptually instead of getting hands dirty to actually implement one. Stanford paper listed in the reference section do a concrete implementation but the route they use is to implement a real-life B-tree deletion which may be complicated for the learner.

The strategy we use for deletion is that we want to find the key to be deleted and remove it first. Then, if the leaf underflows, we borrow from a neighbor. If leaf underflows and can't borrow, we merge nodes and delete parent.

Runtime analysis

We first show that height $H$ is logarithmic in number of data items $N$. Let $M \ge 2$. Because all nodes are at least half full (except root may have only 2 children) and all leaves are at the same level, the minimum number of data items $N$ for a height $H$ tree is

$$ N \ge \underbrace{2(\lceil{M/2}\rceil)^{H-1}}_\textrm{min number of leaves}\times\underbrace{\lceil{L/2}\rceil}_\textrm{min data per leaf} $$

Then for a B-tree of order $M$

Each internal node has up to $M-1$ keys to search
Each internal node has between $\lceil{M/2}\rceil$ and $M$ children
Depth of B-tree storing $N$ items is $O(\log_{\lceil{M/2}\rceil}N)$

Find then takes $O(\log M)$ to do binary search on each node to determine which branch to take. Then the total time is $O(depth \times \log M)$ = $O(\log N)$ because $M$ is small compared to $N$.

Insertion and deletion doesn't different from $O(\log N)$ because the unique manipulation takes constant amount of work and the number of times this unique manipulation is proportional to the height of tree.

Pros & Cons of data structure

What makes B-trees so disk friendly?

Many keys stored in one node
- All brought into memory in one disk access
- Pick $M$ wisely. See MAW's Java version (3rd edition) p.149 for an example.
- Makes the binary searhc over $M-1$ keys totally worth it.
Internal nodes contain only keys
- All find wants only one data item. So only bring one leaf of data items into memory.
- Data-item size doesn't affect what $M$ is. We determine $M$ only by how many keys can be packed into a disk block (node). Thus, the key size, the children pointer size, and the block size are the only factors here.

Links to resources

Here are some of the resources I found helpful while preparing this article:

MAW Chapter 4
Lecture slides 15, 8, and 9 from U.Washington
Yale pinewiki on B-tree
Stanford B-tree implementation paper

Splay Tree

2017-02-13T01:12:00+08:00

This is the summary of Splay tree part in MAW Chapter 4.

Motivation

Ordinary BST has no balance conditions and thus, it is possible for a whole sequnece of $O(N)$ accesses to take place. This cumulative running time then becomes noticeable. So, we introduce the balance condition on BST to improve our running time. One way to do so is to enforce a balance condition when nodes change (i.e. insert or delete) like AVL. However, this data structure is hard to code and rebalancing costs time. In addition, sometimes it is OK for us to have $O(N)$ operation as long as it occurs infrequently. In other words, A search data structure with $O(N)$ worst-case time, but a guarantee of at most $O(M \log N)$ for any $M$ consecutive operations, is good enough. Splay tree meets our needs. It is a data structure that lies right in-between BST (no balance condition) and AVL (very strict balance condition).

Concept

A splay tree is a type of balanced binary search tree. Structurally, it is identical to an ordinary binary search tree; the only difference is in the algorithms for finding, inserting, and deleting entries. Specifically, splay tree is a self-adjusting tree, which the structure get organized over time as nodes are accessed (i.e., insert, delete, or find). This makes sense because if we don't re-structure the tree each time we access an node, then the amortized time bound should be $O(M N)$ for a sequence of $M$ accesses instead of $O(M \log N)$.

The way we restructure the tree is called splaying. Chapter 4 talks about bottom-up splaying algorithms. Every time a node is accessed in a splay tree, it is moved to the root of the tree. The amortized cost of the operation is $O(\log N)$. As shown by MAW, simply moving the element to the root by rotating it up the trees does not have this property. However, the following three structuring rules do guarantee this amortized bound.

                             y             x
   Zig (terminal case):     /     ====>     \               (same as AVL single rotation)
                           x                 y

                    z              z
                   /              /             x
   Zig-zag:       y     ====>    x   ====>     / \          (same as AVL double rotation)
                   \            /             y   z
                    x          y

                    z                         x
                   /            y              \
   Zig-zig:       y     ====>  / \   ====>      y
                 /            x   z              \
                x                                 z

In the above pictures, x is the node that was accessed (that will eventually be at the root of the tree). By looking at the local structure of the tree defined by x, x's parent, and x's grandparent we decide which of three rules to follow. We continue to apply the rules until x is at the root of the tree.

Splay(1)

              7                    7                   7               1
             /                    /                   /                 \
            6                    6                   6                   6
           /                    /                   /                   / \
          5                    5                   1                   4   7 
         /      =======>      /        =======>     \    ======>      / \
        4                    4                       4               2   5
       /                    /                       / \               \
      3                    1                       2   5               3
     /                      \                       \
    2                        2                       3
   /                          \
  1                            3

Implementation details

Splay tree is really a flexible data structure in the sense that there are many options to implement the "splay" rules and corresponding tree operations and still have the property hold. Reference wiki for complete summary. Here, I only mention some of my findings. Please note that depends on how you implement your operations, the resulting tree may be different (i.e. different insert algorithm may result in different tree structure but there root will be the same).

Insertion (bottom-up)

There are two ways to do this. The first way is to use "split" to split the tree based upon the insertion value. By the property of splaying, we will either have the insertion value (already inside the tree) or the parent of the insertion point at the root. Then we can make our insertion value as the new root and adjust the orginal tree to form the new tree. For example, if the insertion value elem smaller than the root, we do

if (elem < T->Element)
{
  newT->Right = T;
  newT->Left = T->Left; // we can do this b/c the result of splaying is the parent node of where we should insert.
  T->Left = NULL;
}

The code is straightforward but newT->Left = T->Left is worth a remark. Here, when T is the parent of the insertion point, we know T's left subtree's values are smaller than elem as well. This is because if there is any node $x$ greater than elem but smaller than T's value, then T should be $x$ instead (by splaying), which is a contradiction.

The second way is to do BST insertion first and then splay the insertion value, which is really straightforward and easy to code.

Deletion (bottom-up)

Correspondingly there are two methods to deletion as well. The first way is to splay the to-be-deleted node. This puts the node at the root. If it is deleted, we get two subtrees $T_L$ and $T_R$. If we find the largest element in $T_L$, then this element is rotated to the root of $T_L$, and $T_L$ will now have a root with no right child. We can finish the deletion by making $T_R$ the right child.

The second way is to do BST deletion first, and then splay the parent of the deletion point to the root. It is quite similar to BST deletion and see the implementation here.

Properties

any $M$ consecutive tree operations starting from an empty tree take at most $O(M \log N)$ time.
Even though the worst-case running time is $O(N)$ for operation, the amortized cost of the operation is $O(\log N)$.
if all nodes in a splay tree are accessed in sequential order, the resulting tree consists of a chain of left children. (MAW 4.26.a)
if all nodes in a splay tree are accessed in sequential order, then the total access time is $O(N)$, regardless of the initial tree.

Pros & Cons of data structure

Splay tree is simpler and easier to program. Because of its implicity, splay tree insertion and deletion is typically faster in practice. Find operation can be faster or slower, depending on circumstances. Splay trees are designed to give especially fast access to nodes that have been accessed recently, so they really excel in applications where a small fraction of the nodes are the targets of most of the find operation.

Todo

This post does not cover every part of the splay tree. This post will be updated once I complete the following two parts study:

MAW Chapter 11 gives a thorough study of the amortized cost of the splay tree operations $O( \log N)$.
MAW Chapter 12 gives implementation details on top-down splay tree.

Reference

http://www.cs.cmu.edu/afs/cs/academic/class/15859-f05/www/documents/splay-trees.txt
http://web.stanford.edu/class/archive/cs/cs166/cs166.1146/lectures/08/Small08.pdf (proof of properties in a concise structure)
http://digital.cs.usu.edu/~allan/DS/Notes/Ch22.pdf
https://courses.cs.washington.edu/courses/cse373/06sp/handouts/lecture14.pdf
https://people.eecs.berkeley.edu/~jrs/61b/lec/36

AVL Tree

2017-02-05T10:43:00+08:00

This is the summary of AVL tree part in MAW Chapter 4.

Motivation

All BST operations are $O(H)$ time, where $H$ is the height of the tree. In the worst case scenario, when the tree is degenerated, $H = N$, where $N$ is the number of nodes. Thus, the problem with BST is that it can get unbalanced and lead to the worst running time. AVL tree is one of algorithms for keeping BST balanced (others including red-black trees, splay trees, B-trees). Its approach to balancing tree is that we want a pretty good balance (allow a little out of balance).

Concept

AVL tree is a guaranteed $O(log N)$ binary search tree. It is identical to a BST, except that for every node in the tree, the height of the left and right subtrees can differ by at most 1. (The height of an empty tree is defined to be -1).

For simplicity, we really omit the actual data part of the node. The following picture demonstrate what AVL tree should really look like:

Insertion

AVL tree insertion is based upon BST insertion with two addition treatments:

Update the height information of the nodes on the path from the root to the insertion point.
Restores the AVL property when find the node that violates it on the road through rotation operations.

There are four cases inside the insertion (see MAW p.111) and we handle "outside" cases (i.e. left-left or right-right) and "inside" cases (i.e. left-right or right-left) using single rotation and double rotation respectively.

To remember single rotation, you can pick a case, say left-left and remember its picture. (right-right is a mirror case)

In the picture, we need to rebalance the tree at $k_2$. This picture shows how we can implement singleRotateWithLeft routine as well. (Here, "left" means the inbalance is caused by the insertion into the left subtree of the inbalance node.)

Similarly, to remember double rotation, we pick a case, say right-left and remember its picture. (left-right is a mirror case)

In the picture, we need to rebalance the tree at $k_3$. The picture shows how we can implement doubleRotateWithRight routine as well. As you can see from the picture, "double rotation" is essentially the same as two "single rotation": rotate $k_2$ and $k_1$, then $k_2$ and $k_3$.

Identifying which rotation to use by strictly based upon these four cases can work but time-consuming. Here is how I think about this issue from practical point of view: you may compare the insertion value with the inbalance node value to determine which node to use. Here is the detail steps:

Compare the insertion value with the inbalance node value: if comparison result is $<$, then we insert into the left subtree of inbalance node. $>$ otherwise.
If the insertion value is $<$ (or $>$) than the left (or right) child value of the inbalance node, we are doing single rotation.
If the insertion value is in-between, then we are doing double rotation.

See MAW p.117 insertion 13 for an example.

Deletion

Deletion, in fact, is extremely similar to the insertion in the sense that:

It is based upon BST deletion with extra treatment towards node height information and AVL property
There are the same rotation cases we need to consider when we ensure the AVL property satisfied for the nodes.

There is only one difference than insertion, which is there can be more than unbalanced node needed to be taken care of when we walk through the deletion point to the root.

There is nuance in terms of how we think about which rotation to use. In insertion, we think about in terms of insertion point. For instance, if the insert value is smaller than unbalanced node value, and smaller than the unbalanced node's child value, we know we are in left-left case, which is single rotation. However, when we deal with deletion, we actually think about the height of the subtree: a left-left insertion is equivalent as we make the the left subtree of unbalanced node's child taller than its right subtree. In deletion, there is no way we can use a specific element value to decide what rotation we should use (like insertion). Thus, we have the following code in our deletion routine:

if (Height(T->Left) - Height(T->Right) == 2)
  {
    if (Height(T->Left->Left) - Height(T->Left->Right) >= 0) 
      T = singleRotateWithLeft(T); //Left Left case
    else
      T = doubleRotateWithLeft(T); //Left Right case
  }
  else if (Height(T->Right) - Height(T->Left) == 2)
  {
    if (Height(T->Right->Right) - Height(T->Right->Left) >= 0)
      T = singleRotateWithRight(T); //Right Right case
    else
      T = doubleRotateWithRight(T); //Right Left case
  }

Of course, the following simple example may help you understand the code chunk above:

Properties

For every node of the AVL tree, $|Height(left child) - Height(right child)| \le 1$.
Running time for "find", "insert", "delete" is guaranteed to be $O(\log N)$.
The height of AVL tree $H$ is at most $1.44\log _2 N$. (see this post for the proof)
For an insertion, there is at most one rotation (used in non-recursive insertion routine).

Pros & Cons of data structure

Pros:

Search is $O(\log N)$ since AVL trees are always balanced.
Insertion and deletions are also $O(\log N)$.
The height balancing adds no more than a constant factor to the speed of insertion.

Cons:

Difficult to program & debug; more space for height information.
Asymptotically faster but rebalancing costs time.
Most large searches are done in database systems on disk and use other structures (e.g. B-trees).
May be OK to have $O(N)$ for a single operation if total run time for many consecutive operations is fast (e.g. Splay trees). In other words, If amortized logarithmic time is enough, use splay trees.

Reference

MAW Chapter 4
https://courses.cs.washington.edu/courses/cse373/06sp/handouts/lecture12.pdf
https://courses.cs.washington.edu/courses/cse332/10sp/lectures/lecture8.pdf
http://www.geeksforgeeks.org/avl-tree-set-2-deletion/
http://www.mathcs.emory.edu/~cheung/Courses/323/Syllabus/Trees/AVL-delete.html

Solving recurrence relations in a nutshell

2017-02-02T01:05:00+08:00

Able to solve recurrence relation is a very important skill when we study data structures and algorithm. This is a ability that I used to be familar with when I took combinatorics class when I was an undergraduate. However, by that time, I didn't realize how important this skill is from computer science point of view. But, thanks to MAW, I do now.

This post is a study summary note on this very important subject. The aim of this note is to help at least me quickly solve any types of recurrence relation in the future. The content closely follows Chapter 7 "Recurrence Relations and Generating Functions" of "Introductory Combinatorics", which is the textbook I used.

Note

This note is practical-oriented. I will skip the proof of the theorem whenever possible.If you are interested in the proof side of the universe, please read the book.

Linear homogeneous recurrence relation with constant coefficients
- Method 1: Characteristic equation
  - distinct roots (theorem 7.4.1)
  - roots with multiplicities (theorem 7.4.2)
- Method 2: Generating function
Linear nonhomogeneous recurrence relation with constant coefficients
- Method 1: Characteristic equation
- Method 2: Generating functions

Linear homogeneous recurrence relation with constant coefficients

Definition: Let $h_0, h_1, h_2, \dots, h_n, \dots$ be a sequence of numbers. This sequence is said to satisfy a linear recurrence relation of order $k$, provided that there exist quantities $a_1, a_2, \dots, a_k,$ with $a_k \ne 0$, and a quantity $b_n$ (each of these quantities $a_1,a_2,\dots,a_k,b_n$ may depend on $n$) such that

$$ \begin{equation} h_n = a_1h_{n-1} + a_2h_{n-2} + \dots + a_kh_{n-k} + b_n, (n\ge k) \label{eq:1} \end{equation} $$

Example: The Fabonacci sequence $f_0, f_1, f_2, \dots, f_n, \dots$ satisfies the linear recurrence relation

$$ \begin{equation} f_n = f_{n-1} + f_{n-2} (n\ge 2) \end{equation} $$

of order 2 with $a_1 = 1, a_2 = 1,$ and $b_n = 0$.

Definition: The linear recurrence relation \ref{eq:1} is called homogeneous provided that $b_n$ is zero and is said to have constant coefficients provided that $a_1, a_2, \dots, a_k$ are constants.

Method 1: Characteristic equation

Theorem 7.4.1: Let $q$ be a nonzero number. Then $h_n = q^n$ is a solution of the linear homogeneous recurrence relation

$$ \begin{equation} h_n - a_1h_{n-1}-a_2h_{n-2}- \dots - a_kh_{n-k} = 0, (a_k \ne 0, n \ge k) \label{eq:2} \end{equation} $$

with constant coefficients iff $q$ is a root of the polynomial equation (called characteristic equation)

$$ \begin{equation} x_k-a_1x^{k-1}-a_2x^{k-2}- \dots - a_k = 0 \label{eq:3} \end{equation} $$

If the polynomial equation has $k$ distinct roots $q_1, q_2, \dots, q_k$, then

$$ \begin{equation} h_n = c_1q_1^{n}+c_2q_2^n+ \dots + c_kq_k^n \label{eq:4} \end{equation} $$

is the general solution of \ref{eq:2} in the following sense: No matter what initial values for $h_0, h_1, \dots, h_{k-1}$ are given, there are constants $c_1, c_2, \dots, c_k$ so that \ref{eq:4} is the unique sequence which satisfies both the recurrence relation \ref{eq:2} and the initial values.

Example: Solve the Fabonacci recurrence relation

$$ \begin{equation*} f_n = f_{n-1} + f_{n-2} (n\ge 2) \end{equation*} $$

subject to the initial values $f_0 = 0$, and $f_1$ = 1.

We rewrite reccurrence relation into $f(n) - f(n-1) - f(n-2) = 0$ and the characteristic equation of this recurrence relation is

$$ \begin{equation*} x^2 - x - 1 = 0 \end{equation*} $$

and its two roots are $\frac{1+\sqrt 5}{2}$, $\frac{1-\sqrt 5}{2}$, and by theorem 7.4.1,

$$ \begin{equation*} f_n = c_1 \Big(\frac{1+\sqrt 5}{2}\Big)^n + c_2 \Big(\frac{1-\sqrt 5}{2}\Big)^n \end{equation*} $$

is the general solution. We now want constants $c_1$, and $c_2$ so that

$$ \begin{equation*} \begin{cases} c_1 \Big(\frac{1+\sqrt 5}{2}\Big) + c_2 \Big(\frac{1-\sqrt 5}{2}\Big) &=& 1 \qquad (n=1)\\ c_1 + c_2 &=& 0 \qquad (n=0)\\ \end{cases} \end{equation*} $$

and we have $c_1 = \frac{1}{\sqrt 5}$, and $c_2 = -\frac{1}{\sqrt 5}$. Thus,

$$ \begin{equation*} f_n = \frac{1}{\sqrt 5}\Big(\frac{1+\sqrt 5}{2}\Big)^n - \frac{1}{\sqrt 5}\Big(\frac{1-\sqrt 5}{2}\Big)^n \end{equation*} $$

is the solution of the Fabonacci recurrence relation.

Note

As you might notice, theorem 7.4.1 explicitly requires that the roots of the characteristic equation have to be distinct. However, that's not always the case and theorem 7.4.1 will not work (see book for an example). That's why we need theorem 7.4.2.

Theorem 7.4.2: Let $q_1, q_2, \dots, q_n$ be the distinct roots of the following characteristic equation of the linear homogeneous recurrence relation with constant coefficients:

$$ \begin{equation} h_n = a_1h_{n-1}+a_2h_{n-2}+ \dots + a_kh_{n-k}, a_k \ne 0, \qquad (n \ge k) \label{eq:5} \end{equation} $$

If $q_i$ is an $s_i$-fold root fo the characteristic equation of \ref{eq:5}, the part of the general solution of this recurrence relation corresponding to $q_i$ is

$$ \begin{equation*} H_{n}^{(i)} = c_1q_i^n + c_2nq_i^n + \dots + c_{s_i}n^{s_i-1}q_i^n \end{equation*} $$

The general solution of the recurrence relation is

$$ \begin{equation*} h_n = H_n^{(1)} + H_n^{(2)} + \dots + H_n^{(t)} \end{equation*} $$

Example: Solve the recurrence relation

$$ \begin{equation*} h_n = -h_{n-1} + 3h_{n-2}+5h_{n-3}+2h_{n-4} \qquad (n \ge 4) \end{equation*} $$

subject to the initial values $h_0=1$, $h_1 = 0$, $h_2 = 1$, and $h_3 = 2$.

The characteristic equation of this recurrence relation is $x^4 + x^3 -3x^2 - 5x - 2 = 0$, which has roots $-1$, $-1$, $-1$, $-2$. Thus, the part of the general solution corresponding to the root $-1$ is

$$ \begin{equation*} H_n^{(1)} = c_1(-1)^n + c_2n(-1)^n + c_3n^2(-1)^n \end{equation*} $$

while the part of a general solution corresponding to the root $2$ is $H_n^{(2)} = c_42^n$. The general solution is

$$ \begin{equation*} h_n = H_n^{(1)} + H_n^{(2)} = c_1(-1)^n + c_2n(-1)^n + c_3n^2(-1)^n + c_42^n \end{equation*} $$

Then we can use initial values to determine $c1$, $c2$, $c3$, $c4$ and we have $h_n = \frac{7}{9} (-1)^n - \frac{3}{9}n(-1)^n + \frac{2}{9}2^n$.

Method 2: Generating function

Definition: Let $h_0, h_1, h_2, \dots, h_n, \dots$ be an infinite sequence of numbers. Its generating function is defined to be the infinite series

$$ \begin{equation} g(x) = h_0 + h_1x + h_2x^2 + \dots + h_nx^n + \cdots \end{equation} $$

The coefficient of $x^n$ in $g(x)$ is the general solution to $h_n$. As you can see, generating functions are Taylor series (power series expansion) of infinitely differentiable functions. If we can find the function (i.e. $g(x)$) and its Taylor series, then the coefficients of the Taylor series give the solution to the problem.

Let's illustrate this method using an example.

Example: Solve the recurrence relation

$$ \begin{equation*} h_n = 5h_{n-1} - 6h_{n-2} \qquad (n \ge 2) \end{equation*} $$

subject to the initial values $h_0 = 1$ and $h_1 = -2$.

We first rewrite the recurrence relation into $h_n -5h_{n-1} + 6h_{n-2} = 0 \quad (n \ge 2)$. Let $g(x) = h_0 + h_1x + h_2x^2 + \dots + h_nx^n + \cdots$ be the generating function for the sequence $h_0, h_1, \dots, h_n, \dots$. We then form the following system of equations with the multipliers chosen based upon our rewritten recurrence relation initially.

$$ \begin{eqnarray*} g(x) &=& h_0 + h_1x + h_2x^2 + \dots + h_nx^n + \cdots \\ -5xg(x) &=& -5h_0x - 5h_1x^2 - \dots - 5h_{n-1}x^n - \cdots \\ 6x^2g(x) &=& 6h_0x^2 + \dots + 6h_{n-2}x^n + \cdots \end{eqnarray*} $$

If you look at the coefficients of $x^n$ term vertically of all these three equations, you can see that they match our recurrence relation exactly. Now, we add these three equations together, we obtain

$$ \begin{equation*} (1-5x+6x^2)g(x) = h_0 + (h_1-5h_0)x + (h_2 - 5h_1 + 6h_0)x^2 + \dots + (h_n - 5h_{n-1} + 6h_{n-2})x^n + \cdots . \end{equation*} $$

since $$h_n - 5h_{n-1} + 6h_{n-2} = 0 \quad (n \ge 2)$ and our initial condition, we have

$$ \begin{equation*} (1-5x+6x^2)g(x) = h_0 + (h_1 - 5h_0)x = 1 -7x \end{equation*} $$

Thus,

$$ \begin{equation*} g(x) = \frac{1-7x}{1-5x+6x^2} \end{equation*} $$

Now, we need to expand $g(x)$ in order to get the coefficient of $h_n$. Since $1-5x+6x^2 = (1-2x)(1-3x)$, we can write

$$ \begin{equation*} \frac{1-7x}{1-5x+6x^2} = \frac{c_1}{1-2x} + \frac{c_2}{1-3x} \end{equation*} $$

for some constants $c1$ and $c2$. We can determine $c1$ and $c2$ by multiplying both sides of this equation by $1-5x+6x^2$ to get

$$ \begin{equation*} 1 - 7x = (c_1 + c_2) + (-3c_1 -2c_2)x \end{equation*} $$

We can get $c_1 = 5$ and $c_2 = -4$. Since

$$ \begin{equation*} \frac{1}{(1-rx)^n} = \sum_{k=0}^\infty\dbinom{n+k-1}{k}r^kx^k \qquad \Big(|x| < \frac{1}{|r|}\Big) \end{equation*} $$

We have

$$ \begin{equation*} \frac{1}{1-2x} = 1 + 2x + 2^2x^2 + \dots + 2^nx^n + \cdots \end{equation*} $$

$$ \begin{equation*} \frac{1}{1-3x} = 1 + 3x + 3^2x^2 + \dots + 3^nx^n + \cdots \end{equation*} $$

$$ \begin{eqnarray*} g(x) &=& 5(1 + 2x + 2^2x^2 + \dots + 2^nx^n + \cdots) -4(1 + 3x + 3^2x^2 + \dots + 3^nx^n + \cdots) \\ &=& 1 + (-2)x + (-15)x^2 + \dots + (5\times2^n - 4\times3^n)x^n + \cdots \end{eqnarray*} $$

Thus, $h_n = 5\times2^n - 4\times3^n$.

Linear nonhomogeneous recurrence relation with constant coefficients

nonhomogeneous means $b_n$ in \ref{eq:1} is no longer zero constant.

Method 1: Characteristic equation

Steps:

1) Find the general solution of the homogeneous relation.

2) Find a particular solution of the nonhomogeneous relation.

If $b_n$ is a polynomial of degree $k$ in $n$, then look for a particular solution $h_n$ that is also a polynomial of degree $k$ in $n$. Thus, try
- $h_n = r$ (a constant) if $b_n = d$ (a constant)
- $h_n = rn + s$ if $b_n = dn + e$
- $h_n = rn^2 + sn + t$ if $b_n = dn^2 + en + f$
If $b_n$ is an exponential, then look for a particular solution that is also an exponential. Thus, try $h_n = pd^n$ if $b_n = d^n$ or $h_n = pnd^n$ if the first try doesn't work.

3) Combine the general solution and the particular solution so that the combined solution satisfies the initial conditions.

Example: Solve

$$ \begin{eqnarray*} h_n &=& 3h_{n-1} - 4n, \qquad (n \ge 1) \\ h_0 &=& 2 \end{eqnarray*} $$

We first consider corresponding homogeneous recurrence relation $h_n = 3h_{n-1}$ and its characteristic equation is $x - 3 = 0$. and thus we have the general solution $h_n = c3^n, \quad (n \ge 1)$.

Now we seek a particular solution of the nonhomogeneous recurrence relation $h_n = 3h_{n-1}-4n, \quad (n \ge 1)$. We try to find a solution of the form $h_n = rn + s$ for some constant number $r$ and $s$. We plug in our conjecture into the recurrence relation and get

$$ \begin{equation*} rn + s = 3(r(n-1)+s) - 4n = (3r-4)n + (-3r+3s) \end{equation*} $$

Thus, $r = 2$ and $s = 3$ and $h_n = 2n + 3$. Now, we combine the general solution of the homogeneous relation with the particular solution of the nonhomogeneous relation to obtain

$$ \begin{equation*} h_n = c3^n + 2n + 3 \end{equation*} $$

Now, let's use inital condition to solve for $c$ and we have $c = -1$. So, $h_n = -3^n + 2n + 3$.

Note

As you can see, solving recurrence relation using characteristic equation has strong connection with solving differential equations (both homogeneous and nonhomogeneous).

Method 2: Generating function

There is nothing difference in using "generating function" method to solve nonhomogeneous than solve homogeneous recurrence relation. That's actually a beauty of this method: nothing needs to tweak in order to work under different situation.

Note

Certainly, not all recurrence relation appeard in computer science can be easily solved by the method described in this post. For instance, inside Josephus problem, recurrence relation may depend on whether $n$ is odd or even and methods may not apply nicely. This implies another type of technique to solve recurrence relation is to guess the solution and prove it by induction. Also, in the book, solving $h_n = h_{n-1} + n^3$ on p. 250 is not standard as well.

Binary Tree & Binary Search Tree

2017-01-29T13:20:00+08:00

This is the summary of binary tree and binary search tree part in MAW Chapter 4.

Concept
Classification of binary trees
Properties
Selected Proofs
Reference

Concept

A binary tree is a tree in which no node can have more than two children.
An important application of binary trees is binary search tree: for every node, $X$, in the tree, the values of all the keys in its left subtree are smaller than the key value in $X$, and the values of all the keys in its right subtree are larger than the key value in $X$ (BST-property).

Classification of binary trees

A full binary tree (proper binary tree or 2-tree) is a binary tree in which each node has exactly zero or two children.

Note

A full node in a binary tree is a node that has exactly two non-null children.

A complete binary tree is a binary tree, which is completely filled, with the possible exception of the bottom level, which is filled from left to right.

A perfect binary tree: A binary tree in which all internal nodes have exactly two children and all leaves are at the same level. It has property: each level has exactly twice as many nodes as the previous level (since each internal node has exactly two children).

A balance binary tree: a binary tree structure in which the left and right subtrees of every node differ in height by no more than 1. Yes, AVL tree definition.

Properties

The average depth of binary tree is $O(\sqrt{n})$
The height of balance binary tree is $O(\log n)$
A complete binary tree on $n$ nodes has height $\lfloor \log n \rfloor$
A binary tree of $N$ nodes, there are $N+1$ NULL pointers representing children (MAW 4.4)
The maximum number of nodes in a binary tree of height $H$ is $2^{H+1}-1$ (MAW 4.5)
- A perfect binary tree of height $H$ contains exactly $2^{H+1}-1$ nodes, of which $2^H$ are leaves.
The number of full nodes plus one is equal to the number of leaves in a nonempty binary tree
The average depth of a node in a binary search tree constructed from random data is $O(\log n)$
The average of height of a random binary search tree is $O(\log n)$ (i.e., $E[h] = O(\log n)$) (MAW 4.14)
All the basic operations find, findMin, findMax, insert, and delete $O(H)$ time, where $H$ is the height of the tree.
- worst case: height $H = n - 1$
- base case: height $H = \log N$, where the tree is a complete binary tree.
Randomly built binary search trees:
- The average height is much closer to the best case.
- Little is known about the average height when both insertion and deletion are used.
- characteristics
  - Keys inserting in random order into an initially empty tree.
  - Each of the $n!$ permutations of the input keys is equally likely.

Selected Proofs

Let's prove "The number of full nodes plus one is equal to the number of leaves in a nonempty binary tree" using induction:

Let $n$ be the number of full nodes and $m$ be the number of leaves in a nonempty binary tree. Then, we have $n + 1 = m$.
- Base case: $n = 0$, since the binary tree is nonempty, we have the degenerated binary tree with $m = 1$.
- Recursion: Suppose the claim holds for $n$ (i.e., $n + 1 = m$. We want to show that the claim also holds for $n+1$. There are two cases we need to consider:
  - We add two leaves to form one extra full node. In this case, we have $m - 1 + 2 = m+1$ leaves. Thus, we have $m+1 - (n+1) = n+1+1 - (n+1) = 1$. The claim holds.
  - We add one leaf to form one extra full node. In this case, we have $m + 1$ leaves. Thus, we have $m+1 - (n+1) = 1$. The claim holds.

Reference

MAW Chapter 4: Tree writing questions

2017-01-26T17:41:00+08:00

There are a lot of writing questions in Chapter 4. Some questions offer great insights on the general techniques in solving algorithmatic proving questions. So, I decide to record them in this single post. Of course, this post will be continually updated as I work through the chapter.

Insights

Recursive tree definition is a natural fit with induction (i.e., MAW 4.5, 4.6, 4.7).
Usually there are two ways to prove a problem in tree, one direction is from induction and the other one is from basic tree property (i.e., MAW 4.4, 4.6).
Combinatorics (relating to binomials) and Probability theory (discrete part) are important to look at (i.e., MAW 4.14)
We can usually study some specific examples, and try to generalize them to form induction proof. In addition, always remember we want to convert the problem for $n+1$ into the same problem but with the inductive step on $n$. (MAW 4.17)

Solutions

including: MAW 4.4, 4.5, 4.6, 4.7, 4.14, 4.15, 4.16, 4.17, 4.23, 4.24, 4.25, 4.26.a, 4.43

MAW 4.4

Show that in a binary tree of $N$ nodes, there are $N + 1$ NULL pointers representing children.

Proof: For a binary tree with $N$ nodes, there are two types of edges (pointers):

edges that are doesn't exist (NULL pointers).
edges that exist to connect nodes (not NULL pointers).

Let's first calculate the number of pointers in total, regardless whether the pointer is NULL or not. Since each node has $2$ outgoing pointers, there are $2N$ pointers in total. Next, we need to calculate the number of edges that actuall exist. Since each edge connects some node to its parent, and every node except the root has one parent. In other words, each node, except the root node, has one incoming pointer from its parent. So, we have $N-1$ edges existing. Thus the remaining $2N - (N-1) = N+1$ edges are actually non-existing. Thus, we have $N+1$ NULL pointers.

MAW 4.5

Show that the maximum number of nodes in a binary tree of height $H$ is $2^{H+1}-1$.

Proof: Let's prove this by induction.

Base case: $H = 0$. A binary tree of height $0$ has only one node, root. $2^{H+1}-1$ equals one for $H = 0$. Therefore ture for $H = 0$.

Inductive Hypothesis: Assume that the maximum number of nodes in a binary tree of height $H$ is $2^{H+1}-1$ for $H = 1, 2, ..., k$. Consider a tree $T$ of height $k+1$. The root of $T$ has a left subtree and a right subtree each of which has height at most $k$. These can have at most $2^{k+1}-1$ nodes each by the inductive hypothesis. Adding the root node gives the maximum number of nodes in a binary tree of height $k+1$,

$$ \begin{equation} 2(2^{k+1} - 1) + 1 = 2^{(k+1)+1} - 1 \end{equation} $$

Remarks:

The maximum condition achieves when we have perfect binary tree.

$$ \begin{equation} n = \sum_{i=0}^{h} 2^i = 2^{h+1} - 1 \text{where n is the number of nodes} \end{equation} $$

MAW 4.6

A full node is a node with two children. Prove that the number of full nodes plus one is equal to the number of leaves in a nonempty binary tree.

Let's use two methods to prove this question.

Method 1:

Proof: Let's use the following notation for our proof:

$$ \begin{eqnarray*} N & = & \text{number of nodes in a nonempty binary tree} \\ F & = & \text{number of full nodes} \\ H & = & \text{number of nodes with one child} \\ L & = & \text{number of leaves} \end{eqnarray*} $$

Then, we have $N = F + H + L \label{eq:1}$. We can get another equation based on the number of edges: $N - 1 = 2F + H \label{eq:2}$. $N-1$ is the number of edges for a $N$ node binary tree and $2F + H$ is another way to calculate the number of edges. Now based on these two euqations we have:

$$ \begin{eqnarray*} 2F + H + 1 & = & F + H + L \\ F + 1 & = & L \end{eqnarray*} $$

Method 2:

Proof: Let's prove by induction. If there are $N$ full nodes in a non-empty binary tree then there are $N+1$ leaves.

Base case: $N = 0$ This is ture because the tree has one node and the root is a leaf.

Inductive hypothesis: Suppose the theorem holds for $N = 1, 2, ..., k$. Then we want to show that if there are $k+1$ full nodes in a non-empty binary tree then there are $k+2$ leaves. Pick a leaf node and keep removing its parent recursively (i.e., remove its parent and then parent's parent and so on) until a full node is reached. That is, you are traversing from a leaf along the path towards the root, while removing the nodes along the path before a full node is reached. This full node becomes a non-full node because one of its child node is removed. At this point the tree will have one less leaf and one less full node.

Therefore, the tree has $k$ full nodes after the nodes are removed. By the inductive hypothesis there are $k+1$ leaves. Add all the nodes that were removed back into the tree the same way to create the original tree. We are adding one full node and one leaf node. Therefore, we have $k+1$ full nodes with $k+2$ leaves.

MAW 4.7

Suppose a binary tree has leaves $l_{1}, l_{2}, ..., l_{M}$ at depths $d_{1}, d_{2}, ...,d_{M}$, repectively. Prove that $\sum_{i=1}^M 2^{-d_{i}} \leq 1$ and determine when the quality is true.

Proof: Let's prove this by induction.

Base case: when $M = 1$, there is one node: the root is a leaf wit depth zero. Then the sum is one, and claim holds.

Inductive hypothesis: Suppose the theorem is true for all trees with at most $k$ nodes. Consider any tree with $k+1$ nodes. Such a tree consists of an $i$ node left subtree and a $k-i$ node right subtree. By the inductive hypothesis, the sum for the left subtree leaves is at most one with respect to the left tree root. Because all leaves are one deeper with respect to the original tree than with respect to the subtree, the sum is at most $1/2$ with respect to the root. Similar logic implies that the sum for leaves in the right subtree is at most $1/2$ proving the theorem.

The equality is true if and only if every internal node is a full node. In other words, no nodes have one child. Suppose there is a node with one child, and the equality still holds. Each time we remove two nodes to create a new tree that has a node with no child. This new tree has the same property has the previous one, and by the statement we proved above, we should have the same sum as the old, which is one. Eventually, we are left with two node, one of them is root. Now, we calculate the sum, which gives $1/2$. This is contradiction to the equality.

Note

This problem is called Kraft–McMillan inequality, which is one of fundamental theorem in Information theory. I find this youtube playlist about information theory is really good as an intro to the field because it doesn't make the material look very daunting and super technical, which some lecture note manages to achieve.

MAW 4.14

Prove that the depth of a random binary search tree (depth of the deepest node) is $O(\log N)$, on average.

This question can be restated like the following: suppose that we insert $n$ distinct elements into an initially empty tree. Assuming that the $n!$ permutations are equally likely to occur, then show that the average height of the tree is $O(\log N)$.

Before we dive into the proof, let's think about how we can construct a random binary search tree. We construct a tree $T$ by inserting in order randomly selected $n$ distinct elements into an initially empty tree. Here the actual values of the elements do not matter. What matters is the position of the inserted element in the $n$ elements. Thus, we construct a random binary search tree as the following:

An element $i$ from the $n$ elements is selected uniformly ar random and is inserted to the empty tree. Then all the other elements are inserted. Here all the elements greater than $i$ go into the right subtree of $i$ and all the elements smaller than $i$ go into the left subtree. Thus, the height of the tree constructed is one plus the larger of the height of the left subtree and the height of the right subtree.

Proof: Following our construction process above, if we randomly choose the $i^{th}$ key, the left subtree has $i-1$ elements and the right subtree has $n-i$ elements. Let $h_{n}$ be the height of a randomly built binary search tree on $n$ keys. Then we have

$$ \begin{equation} h_{n} = 1 + max(h_{i-1}, h_{n-i}) \label{eqn:1} \end{equation} $$

Now, let's define $Y_{n} = 2^{h_n}$. If we can show that $E[Y_n]$ is polynomial in $n$, we then have $E[h_n] = O(\log n)$. Again, $Y_n$ depends on $i$ not $n$. Let's represent \ref{eqn:1} in terms of $Y_n$:

$$ \begin{eqnarray*} h_{n} &=& 1 + max(h_{i-1}, h_{n-i}) \\ 2^{h_n} &=& 2^{1 + max(h_{i-1}, h_{n-i})} \\ &=& 2 \cdot 2^{max(h_{i-1}, h_{n-i})} \\ &=& 2 \cdot max(2^{h_{i-1}}, 2^{h_{n-i}}) \\ Y_n &=& 2 \cdot max(Y_{i-1}, Y_{n-i}) \end{eqnarray*} $$

Now, let's calculate $E[Y_n]$. Here, $I=i$ means we pick $i_{th}$ element as our first element inserting into the empty tree.Since, we pick the first insertion element equally likely, then $P(I=i) = \frac{1}{n}$.

$$ \begin{eqnarray*} E[Y_n] &=& \sum_{i=1}^n E[Y_n|I=i]P(I=i) \\ &=& \sum_{i=1}^n E[Y_n|I=i]\frac{1}{n} \\ &=& \frac{2}{n}\sum_{i=1}^n E[max(Y_{i-1},Y_{n-i})] \\ &\le& \frac{2}{n}\sum_{i=1}^n (E[Y_{i-1}] + E[Y_{n-i}]) \end{eqnarray*} $$

Now we expand the last summation as

$$ \begin{equation*} (E[Y_0] + E[Y_{n-1}]) + \dots + (E[Y_{n-1}] + E[Y_0]) = 2\sum_{i=0}^{n-1}E[Y_i] \end{equation*} $$

Thus, we have

$$ \begin{equation*} E[Y_n] \le \frac{4}{n}\sum_{i=0}^{n-1}E[Y_i] \end{equation*} $$

Then, we will show that for all integers $n>0$,

$$ \begin{eqnarray*} E[Y_n] &\le& \frac{1}{4}\dbinom{n+3}{3} \\ &=& \frac{1}{4}\cdot\frac{(n+3)(n+2)(n+1)}{6} \\ &=& O(n^3) \end{eqnarray*} $$

Then, we use Jensen's inequality, which states that $f(E[X]) \le E[f(X)]$ provided the expectations exist and are finite, and f(x) is convex. Let this $X$ be $h_n$ and $f(x) = 2^x$, then $E[f(X)] = E[Y_n]$. So, we have

$$ \begin{equation*} 2^{E[h_n]} \le \frac{1}{4}\dbinom{n+3}{3} = O(n^3) \end{equation*} $$

By taking the log of both sides, we have $E[h_n] = O(\log n)$

Remarks:

Let's first prove $\sum_{i=0}^{n-1}\dbinom{i+3}{3} = \dbinom{n+3}{4}$

Proof: Use Pascal's identity: $\dbinom{n}{k} = \dbinom{n-1}{k-1} + \dbinom{n-1}{k}$ Also using the simple identity $\dbinom{4}{4} = 1 = \dbinom{3}{3}$. We have:

$$ \begin{eqnarray*} \dbinom{n+3}{4} &=& \dbinom{n+2}{3} + \dbinom{n+2}{4} \\ &=& \dbinom{n+2}{3} + \dbinom{n+1}{3} + \dbinom{n+1}{4} \\ &=& \dbinom{n+2}{3} + \dbinom{n+!}{3} + \dbinom{n}{3} + \dbinom{n}{4} \\ &\vdots& \\ &=& \dbinom{n+2}{3} + \dbinom{n+!}{3} + \dbinom{n}{3} + \dots + \dbinom{4}{3} + \dbinom{4}{4} \\ &=& \sum_{i=0}^{n-1}\dbinom{i+3}{3} \end{eqnarray*} $$

Let's prove $E[Y_n] \le \frac{1}{4}\dbinom{n+3}{3}$ by induction.

Proof: Base case: $n=1$.

$$ \begin{equation*} 1 = Y_1 = E[Y_1] \le \frac{1}{4}\dbinom{1+3}{3} = 1. \end{equation*} $$

Inductive hypothesis: Assume that $E[Y_i]\le\frac{1}{4}\dbinom{i+3}{3}$ for all $i<n$. Then,

$$ \begin{eqnarray*} E[Y_n] &\le& \frac{4}{n}\sum_{i=0}^{n-1}E[Y_i] \\ &\le& \frac{1}{4}\dbinom{i+3}{3} \\ &=& \frac{1}{n}\sum_{i=0}^{n-1}\dbinom{i+3}{3} \\ &=& \frac{1}{n}\dbinom{n+3}{4} \\ &=& \frac{1}{n}\frac{(n+3)!}{4!(n-1)!} \\ &=& \frac{1}{4}\frac{(n+3)!}{3!n!} \\ &=& \frac{1}{4}\dbinom{n+3}{3} \end{eqnarray*} $$

Note

I reference this lecture note when I try to develop the proof. Overall, I share the similar proof with this one. However, we have slightly difference in terms of how we define $E[Y_n]$. The note defines an indicator random variables $Z_{n,i} = I\{I=i\}$, where $I=i$ means we pick $i_{th}$ element as our first element inserting into the empty tree. Since, we pick the first insertion element equally likely, then $P(I=i) = \frac{1}{n}$, and thus, $E[Z_{n,i}] = \frac{1}{n}$ by $E[I_A] = P(A)$. Then, he defines $Y_n = \sum_{i=1}^nZ_{n,i} \cdot (2 \cdot max(Y_{i-1}, Y_{n-i}))$ because only one $Z_{n,i}$ can be $1$ and all others are $0$. It seems right but when he calculates the $E[Y_n]$, he states that $Z_{n,i}$ is independent of $Y_{i-1}$ and $Y_{n-i}$. However, I don't think so as the height of the tree $h_n$, which $Y_n$ is constructed from depends on which element we pick first. I tend to think about $E[Y_n]$ as expectation of the conditional expectation.

MAW 4.15

a. Give a precise expression for the minimum number of nodes in an AVL tree of height $H$. b. What is the minimum number of nodes in an AVL tree of height 15?

The minimum number of nodes in an AVL tree of height $H$, $S(H) = S(H-1) + S(H-2) + 1 \quad (H \ge 2)$ with $S(0) = 1$ and $S(1) = 2$. It's a linear nonhomogeneous recurrence relation with constant coefficients. Let's first find out the general solution for corresponding homogeneous recurrence relation $S(H) = S(H-1) + S(H-2)$ first. The characteristic equation is $x^2 - x - 1 = 0$ and the roots are $\frac{1+\sqrt 5}{2}$ and $\frac{1-\sqrt 5}{2}$. So, we have $S(H) = c_1\Big(\frac{1+\sqrt 5}{2}\Big)^H + c_2\Big(\frac{1-\sqrt 5}{2}\Big)^H$.

Now, for a particular solution to the recurrence relation, let's guess $S(H) = r \quad \text{for some constant } r$. This solution has to satisfy the recurrence relation as well. Thus,

$$ \begin{equation*} r = r + r + 1 \end{equation*} $$

So, we have $r = -1$. Thus, $S(H) = c_1\Big(\frac{1+\sqrt 5}{2}\Big)^H + c_2\Big(\frac{1-\sqrt 5}{2}\Big)^H - 1$. We plugin the initial condition to our general solution to solve for $c_1$ and $c_2$. We get $c_1 = 1 + \frac{2}{\sqrt 5}$ and $c_2 = 1 - \frac{2}{\sqrt 5}$. Thus, we have

$$ \begin{equation} S(H) = \Big(1 + \frac{2}{\sqrt 5}\Big)\Big(\frac{1+\sqrt 5}{2}\Big)^H + \Big(1 - \frac{2}{\sqrt 5}\Big)\Big(\frac{1-\sqrt 5}{2}\Big)^H - 1 \label{eqn:2} \end{equation} $$

Now, let $H = 15$ and we have $S(15) = 2583$.

Note

Initial condition is for the general solution for the recurrence relation, not the homogeneous part. Thus, we cannot use the initial condition immediately when we have our homogeneous part done.We need to wait until the whole solution (homogeneous part + particular part).

Remarks:

With \ref{eqn:2}, we can actually get the bound of the height of an AVL tree.

By \ref{eqn:2}, we see that $S(H) \ge \Big(\frac{1+\sqrt 5}{2}\Big)^H$. Suppose we have $N$ nodes in an AVL tree of height $H$. Then $N \ge S(H) \ge \Big(\frac{1+\sqrt 5}{2}\Big)^H$. Let $\phi = \frac{1+\sqrt 5}{2}$, then we have $\log _\phi N \ge H$, which is $H \le 1.44\log _2 N = O(\log N)$.

MAW 4.16

Show the result of inserting 2,1,4,5,9,3,6,7 into an initially empty AVL tree.

MAW 4.17

Keys $1, 2, \dots, 2^k-1$ are inserted in order into an initially empty AVL tree. Prove that the resulting tree is perfectly balanced ¹.

Proof: Let's use induction on $k$ to prove the following statement:

The result of inserting any increasing sequence of $2^k - 1$ numbers into an initially empty AVL tree results in a perfectly balanced tree of height $k-1$.

Base case: $k = 1$. Tree has only one node. This is clearly perfectly balanced. Inductive hypothesis: Assume hypothesis is true for $k = 1, 2, \dots, h$. We want to prove that it is true for $k = h + 1$, i.e., for sequence $1, 2, \dots, 2^{h+1}-1$.

After the first $2^h - 1$ insertions, by the induction hypothesis, the tree is perfectly balanced, with height $h-1$. $2^{h-1}$ is at the root (can be observed for $1 \ge k \le 3$ situation, where the roots are $1$, $2$, $4$ respectively). The left subtree is a perfectly balanced tree of height $h-2$, and the right subtree is a perfectly balanced tree containing the numbers $2^{h-1}+1$ through $2^h-1$, also of height $h-2$. See the following picture:

Each of the next $2^{h-1}$ insertions ($2^h$ through $2^h + 2^{h-1} - 1$) are inserted into the right subtree, and the entire sequence of numbers in the right subtree (now $2^{h-1}+1$ through $2^h + 2^{h-1}-1$) were inserted in order and are a sequence of $2^h - 1$ nodes (i.e. $2^h + 2^{h-1}-1 - (2^{h-1}+1) + 1 = 2^h -1$). By induction hypothesis, they form a perfectly balanced tree of height $h-1$. See the following picture:

The next insertion, of the number $2^h + 2^{h-1}$, imbalances the tree at the root because now the height of the right subtree is $h$ and the height of the left subtree is $h-2$. Now, we do a single rotation and form a tree with root $2^h$, and a perfectly balanced left subtree of height $h-1$. The right subtree consists of a perfectly balanced tree (of height $h-2$), with the new node: $2^h + 2^{h-1}$. See the following picture:

Thus, the right subtree is as if the numbers $2^h+1, \dots, 2^h + 2^{h-1}$ had been inserted in order. We subsequently insert the numbers $2^h + 2^{h-1} + 1$ through $2^{h+1} - 1$ nodes. In other words, we form the right subtree by inserting the numbers $2^{h} + 1, \dots, 2^{h+1} - 1$, which have $2^{h} - 1$ numbers. Then, by the inductive hypothesis, these $2^{h} - 1$ insertions form a perfectly balanced subtree of height $h-1$. See the following picture:

Since the left and right subtrees are perfectly balanced (height $h-1$), the whole tree is perfectly balanced.

MAW 4.23

MAW 4.24

MAW 4.25

This problem is solved through brute-force calculation. You can reference the example from figure 4.46 to figure 4.55. I calculate for internal path length of the tree and find(1), find(2). The answer is slightly off than the solution manual. May need to double check.

MAW 4.26

a. Show that if all nodes in a splay tree are accessed in sequential order, the resulting tree consists of a chain of left children.

Proof: Let's prove by induction. Let $N$ denote the number of nodes in a splay tree.

Base case: When $N = 1$, the claim holds.

Inductive hypothesis: all nodes $1, \dots, N$ in a splay tree are accessed in sequential order, the resulting tree consists of a chain of left children. We want to show that this holds for $N+1$. Once we access first $N$ nodes, there are only one position for $N+1$ node: the right child of the root. The rest of positions are impossible because if the $N+1$ node is the right child of any node between the left most node and the root of the resulting tree, then by BST, $N+1$ node's value is smaller than root's value and bigger than left most node's value. This violates the induction hypothesis because we are no longer access a splay tree in sequential order. Now we simply swap the right child of the root with root and we get a chain of left children.

MAW 4.43

a. Show that via AVL single rotations, any binary search tree $T_1$ can be transformed into another search tree $T_2$ (with the same keys). b. Give an algorithm to perform this transformation using $O(N\log N)$ rotations on average. c. Show that this transformation can be done with $O(N)$ rotations, worst-case.

Let's first work through an example shown in the picture below. We transform the tree in the top-left of the picture to the tree in the top-right of the picture through several steps linked by single arrows.

As you can see, the strategy here is that we do preorder processing. We compare the root $T_1$ with the root $T_2$. If they are equal, then we move on to the left and right subtrees of $T_1$ and do the processing recursively. However, if they are not equal, we find the $T_2$'s root value $x$ in $T_1$ and rotate it to the root of $T_1$. Then, we do the recursive processing for the left and right subtrees of $T_1$. This algorithm takes $O(N\log N)$ on average because find $x$ takes $O(\log N)$ time and AVL rotations also take $O(\log N)$ time. Since we could do $N$ rotations, then the result follows. However, a BST can be degenerated and in that case, we have $O(N)$ worst-case.

The solution and corresponding figures are majorly taken from this link with minor wording tweak to allow easy understanding for myself. ↩

Tree Terminology

2017-01-24T20:23:00+08:00

Terminology

Like "list" in Chapter 3, "Trees" is another type of abstraction.

tree, root, edge

We can define a tree recursively. A tree is a collection of nodes. The collection can be empty; otherwise, a tree consists of a distinguish node r, called the root, and zero or more nonempty (sub)tress $T_1$, $T_2$, ..., $T_k$, each of whose roots are connected by a directed edge from r.

child, parent

The root of each subtree is said to be a child of r, and r is the parent of each subtree root.

leaves

Nodes with no children are known as leaves.

siblings

Nodes with the same parent are siblings.

path

A path from node $n_1$ to $n_k$ is defined as a sequence of nodes $n_1$, $n_2$, ..., $n_k$ such that $n_i$ is the parent of $n_{i+1}$ for $1<= i < k$.

length

The length of this path is the number of edges on the path, namely $k-1$.

Note

There is a path of length zero from every node to itself.
Notice that in a tree there is exactly one path from the root to each node.

depth

For any node $n_i$, the depth of $n_i$, is the length of the unique path from the root to $n_i$.

internal path length

The sum of the depths of all nodes in a tree is known as the internal path length.

height

The height of $n_i$ is the length of the longest path from $n_i$ to a leaf. (i.e., the height of a node is the number of edges from the node to the deepest leaf)

ancestor, descendant

If there is a path from $n_1$ to $n_2$, then $n_1$ is an ancestor of $n_2$ and $n_2$ is a descendant of $n_1$. If $n_1 \neq n_2$, then $n_1$ is a proper ancestor of $n_2$ and $n_2$ is a proper descendant of $n_1$.

internal node

An internal node is a node with at least one child. In other words, internal nodes are nodes other than leaves.

degree

The total number of children of a node is called as degree of that node. The highest degree of a node among all the nodes in a tree is called as degree of tree.

level

The root node is said to be at level 0 and the children of root node are at level 1 and the children of the nodes which are at level 1 will be at level 2 and so on ... In other words, in a tree each step from top to bottom is called as a level and the level count starts with '0' and incremented by one at each level (step).

predecessor / successor

If $X$ has two children, its predecessor is the maximum value in its left subtree and its successor the minimum value in its right subtree. It makes sense if we do in-order traversal of the tree.

Inorder traversal

Given a tree shown below, the inorder traversal (left, root, right) gives: 4, 2, 5, 1, 3

Preorder traversal

Given the same tree above, the preorder traversal (root, left, right) gives: 1, 2, 4, 5, 3

Postorder traversal

Given the same tree above, the postorder traversal (left, right, root) gives: 4, 5, 2, 3, 1

Some properties

A tree is a collection of N nodes, one of which is the root, and N-1 edges. (since each edge connects some node to its parent, and every node except the root has one parent.)
The root is at depth 0.
All leaves are at height 0.
The height of a tree is equal to the height of the root.
The depth of a tree is equal to the depth of the deepest leaf; this is always equal to the height of the tree.

Example

Let's work through MAW 4.1, 4.2, and 4.3 to get the tree terminology clear.

"A" is the root
"G", "H", "I", "L", "M", "K" are the leaves
"A":
- children: "B", "C"
- depth: 0
- height: 4
"B":
- parent: "A"
- children: "D", "E"
- siblings: "C"
- depth: 1
- height: 3
The depth of the tree is 4

MAW: Chapter 3 Reflection

2017-01-23T10:45:00+08:00

I finally finish Chapter 3: List, Stacks, and Queues with almost all the problems solved. It's time to do some summary and reflection.

Reflection

One important philosophy in this chapter is the separation between the interface exposed to the user and the implementation details behind the scene. The interface exposed to the user is the Abstract Data Types (ADTs). In this chapter, the interface in this chapter is "List". However, there are multiple implementations can meet the requirement of the interface, which are "Array" and "Linked Lists". We can further categorize "List" interface into different subcategories "Stack", "Queue", "Deque". In other words, even we talk about "Stack" ADT, "Queue" ADT, and "Deque" ADT, they are all essentially the "List" but with some restrictions in terms of list operations. Here is a picture that helps us to understand this concept better:

In terms of actual implementation, we can get a sense of what's the basic structure that a data structure should have. Take a linked list implementation of a queue as an example. The key characteristics of a queue is that it should have a $O(1)$ operation on both enqueue and dequeue. This leads us to the pointer to pointing both the front and rear of the list. Thus, our queue "queue.h" and "queue.c" look like respectively:

struct QueueRecord;
struct QueueCDT;
typedef struct QueueRecord* PtrToNode;
typedef struct QueueCDT* QueueADT;

struct QueueRecord
{
  ET Element;
  PtrToNode Next;
};

struct QueueCDT
{
  PtrToNode Front;
  PtrToNode Rear;
};

Here, our queue pointed by QueueADT is defined by two pointers: Front and Rear. Then, Those two pointers are pointing to our actual QueueRecord, which how we form our linked list implementation. So, when we implement a data structure, we can take a top-down view by first thinking about what characterizes our data structure. That's the very important first step we take. Then, we can think follow the flow naturally by defining what are the required elements to implement those characteristics.

Chapter Structure

Linked List

Singly Linked List
Doubly Linked List
Circular Linked List
Applications:
- Polynomial ADT
- Raxi Sort

Stacks

Linked List implementation
Array implementation
Applications:
- Balancing Symbols
- Postfix Expression (Postfix expression evaluation; Infix to Postfix Conversion)
- Function Call Stack

Queue

Array implementation
Linked List implementation

Notable Questions

3.4, 3.5

implement set operations using "List" interface. Especialy the union one provides insights on how we can implement addition of polynomials (3.6) and integer addition (3.9)
3.6, 3.7, 3.8, 3.9

a set of problems relating to Polynomial ADT
3.10, 3.24

problems to practice recurrence relation. Josephus problem is particular interesting because it's a good combination of mathematics, algorithm (dynamic programming) and data structures.
3.12, 3.21, 3.23

commonly-seen interview questions
3.13

require us to actually implement a radix sort in a real problem.
3.18

balancing symbols using Stack. A really cool problem that the end-product is a tool that you can use in your daily work.
3.19, 3.20

Postfix, Infix related problems. Learn about "shunting yard" algorithm and how left associate operators (i.e +, -) is different from right associate operators (i.e ^) in terms of implementation.
3.25, 3.26

Implement Queue and its variation, Deque, using different data structures. In particular, circular array implementation.

Left Out

Some material I left out when I work through this chapter:

function calls as an example of stack (this part is going to be covered from computer system point of view in the coming posts).
cursor implementation of linked list (this part is not on the top priority for now).
3.7.c, 3.14, 3.16, 3.18.a, 3.22.b

Num of function calls in recursive Fibonacci routine

2017-01-22T23:12:00+08:00

This is MAW 3.24:

If the recursive rotuine in Section 2.4 used to computeFibonacci numbers is run for N = 50, is stack space likely to run out?Why or why not?

unsigned long
Fib(int N)
{
  if (N <= 1)
    return 1;
  else
    return Fib(N-1) + Fib(N-2);
}

Let's first do an empirical experimentation. By running our test program numCalls and we can get the following output:

    i               Fib(i)          numCalls
    i = 0           1               1
    i = 1           1               1
    i = 2           2               3
    i = 3           3               5
    i = 4           5               9
    i = 5           8               15
    i = 6           13              25
    i = 7           21              41
    i = 8           34              67
    i = 9           55              109
    i = 10          89              177
    i = 11          144             287
    i = 12          233             465
    i = 13          377             753
    i = 14          610             1219
    i = 15          987             1973
    i = 16          1597            3193
    i = 17          2584            5167
    i = 18          4181            8361
    i = 19          6765            13529
    ... snip ...
    i = 50          20365011074     40730022147

We know that the Fibonacci numbers are defined by the following recurrence relation:

$$ \begin{equation} F(n) = F(n-1) + F(n-2), \text{ for }n = 2, 3, ... \label{eq:1} \end{equation} $$

We define $F(0) = F(1) = 1$. Now, we want to find out the number of recursive calls made to calculate $F(n)$. We use $G(n)$ to denote the number of calls made by the recursive program in calculating $F(n)$. Let's examine the output above. We see that $G(0) = G(1) = 1$ and to compute $G(n)$ for arbitrary $n$, we'll make an initial call, and then $G(n-1)$ calls to calculate $F(n-1)$ and $G(n-2)$ calls to calculate $F(n-2)$. Thus, we have the following recurrence relation for $G(n)$:

$$ \begin{equation} G(n) = G(n-1) + G(n-2) + 1 \label{eq:2} \end{equation} $$

Let's solve this recurrence relation by establish the relationship between $F(n)$ and $G(n)$ and then, we can get the closed form based upon the closed form of $F(n)$.

Let's suppose that $G(n)$ depends on $F(n)$ in some way. In other words, $G(n)$ is a function of $F(n)$. Let's try linear form first:

$$ \begin{equation} G(n) = a F(n) + b \text{ where a, b are unknown constants} \label {eq:3} \end{equation} $$

Since we know that $G(0) = G(1) = 1$ and $F(0) = F(1) = 1$, then \ref{eq:3} becomes

$$ \begin{eqnarray*} G(1) & = & a F(1) + b \\ 1 & = & a + b \end{eqnarray*} $$

Now let's plugin \ref{eq:3} into \ref{eq:2} and using the \ref{eq:1} and we have:

$$ \begin{eqnarray*} G(n) & = & G(n-1) + G(n-2) + 1 \\ a F(n) + b & = & G(n-1) + G(n-2) + 1 \\ a (F(n-1) + F(n-2)) + b & = & G(n-1) + G(n-2) + 1 \\ a (F(n-1) + F(n-2)) + b & = & a F(n-1) + b + a F(n-2)) + b + 1 \\ b & = & -1 \end{eqnarray*} $$

Now, our \ref{eq:3} becomes $G(n) = 2F(n) - 1$. That is, the number of function calls to calculate a Fibonacci number $F(n)$ is $2F(n) - 1$.

Then the question asks about "is the stack space likely to run out?". This actually confuses me because it seems like the author tries to indicate that there is a relationship between the number of recursive calls and the actual space the program is going to take in call stack. I have no clue so far. But, maybe we can find out the space of our Fib routine is going to take in call stack and how large the system call stack and we can compare the two to get some insights.

We can use ulimit -a or ulimit -s to find out the size of stack that system allows:

$ ulimit -a
... snip ...
stack size              (kbytes, -s) 10240
... snip ...

As you can see, the default stack size is 10 MB. Let's see how large space our Fib is going to use on stack: as of gcc 4.6, there is an option -fstack-usage to allow us check the function max amount of stack use. Read more info here.

numCalls.c:17:1:Fib     48      static
numCalls.c:27:1:main    64      static

As you can see, Fib only uses 48 bytes and it's quite unlikely to drain out our stack space. But, of course, the runing time is another thing. I mean it's going to be very slow to get the output for $N = 50$.

Future work

This paper mentions that \ref{eq:1} and \ref{eq:2} with their initial conditions respectively form second-order Discrete Dynamical System (DDS). This offers some more mathematical insights. This actually reminds me equation 1.11 in Concrete Mathematics: A Foundation for Computer Science working on a generalized Josephus problem recurrence relation with a system of three equations and three unknown constant coefficients. In fact, this way of solving problem seems anywhere like differential equations, calculating moments in statistics, and so on. Quite interesting.
Lots of things can be said about call stack. In addition, "determine the amount of stack a program uses" is an interesting question that I may dig in the future.

Typecasting in C

2017-01-19T00:53:00+08:00

This post covers the typecasting in C with the aim to get a clear understanding of the most commonly-seen C manipulation. This writeup is adapted from "Hacking - the Art of Exploitation".

TL;DR

Typecasting change a variable into a different type just for that operation.
Pointer type determines the size of the data it points to. In other words, when you do pointer arithemetic (i.e +), the number of bytes change (i.e increase) in terms of memory address is determined by the pointer type.
void pointer is a generic pointer and we need to cast them to the proper data type in order to de-reference it.
Pointer is merely a memory address. With typecasting, any type with enough size to hold the memory address can work like a pointer.

Details

Typecasting is simply a way to temporarily change a variable's data type, despite how it was originally define. When a variable is typecast into a different type, the compiler is basically told to treat that variable as if it were the new data type, but only for that operation.

Typecasting is mostly used with pointers. Before we jump into typecasting, let's take a look why we need to define type for pointer (pointer is just a memory address). One reason for this is to try to limit programming errors. An integer pointer should only point to integer data, while a character pointer should only point to character data. Another reason is for pointer arithmetic. An integer is four bytes in size, while a character only takes up a single byte.

#include <stdio.h>

int main()
{
  int i;

  char char_array[5] = {'a', 'b', 'c', 'd', 'e'};
  int int_array[5] = {1, 2, 3, 4, 5};

  char *char_pointer;
  int *int_pointer;

  char_pointer = char_array;
  int_pointer = int_array;

  for(i = 0; i < 5; i++)
  {
    printf("[integer pointer] points to %p, which contains the integer %d\n", int_pointer, *int_pointer++);
  }

  for(i = 0; i < 5; i++)
  {
    printf("[char pointer] points to %p, which contains the char '%c'\n", char_pointer, *char_pointer++);
  }
}

The program itself is pretty straightforward. Here is the output:

[integer pointer] points to 0x7ffd90db45a4, which contains the integer 1
[integer pointer] points to 0x7ffd90db45a8, which contains the integer 2
[integer pointer] points to 0x7ffd90db45ac, which contains the integer 3
[integer pointer] points to 0x7ffd90db45b0, which contains the integer 4
[integer pointer] points to 0x7ffd90db45b4, which contains the integer 5
[char pointer] points to 0x7ffd90db45c1, which contains the char 'a'
[char pointer] points to 0x7ffd90db45c2, which contains the char 'b'
[char pointer] points to 0x7ffd90db45c3, which contains the char 'c'
[char pointer] points to 0x7ffd90db45c4, which contains the char 'd'
[char pointer] points to 0x7ffd90db45c5, which contains the char 'e'

Even though the same value of 1 is added to int_pointer and char_pointer in their respective loops, the compiler increments the pointer's addresses by different amounts. Since a char is only 1 byte, the pointer to the next char would naturally also be 1 byte over. But since an integer is 4 bytes, a pointer to the next integer has to be 4 bytes over.

#include <stdio.h>

int main()
{
  int i;

  char char_array[5] = {'a', 'b', 'c', 'd', 'e'};
  int int_array[5] = {1, 2, 3, 4, 5};

  char *char_pointer;
  int *int_pointer;

  // The char_pointer and int_pointer now point to incompatible data types.
  char_pointer = int_array;
  int_pointer = char_array;

  for(i = 0; i < 5; i++)
  {
    printf("[integer pointer] points to %p, which contains the integer %c\n", int_pointer, *int_pointer++);
  }

  for(i = 0; i < 5; i++)
  {
    printf("[char pointer] points to %p, which contains the char '%d'\n", char_pointer, *char_pointer++);
  }
}

The output is:

$ gcc pointer_types2.c
pointer_types2.c: In function ‘main’:
pointer_types2.c:13: warning: assignment from incompatible pointer type
pointer_types2.c:14: warning: assignment from incompatible pointer type

Here, the compiler and the programmer are the only ones that care about a pointer's type. In the compiled code, a pointer is nothing more than a memory address, so the compiler will still compile the code if a pointer points to an incompatible data type - it simply warns us to anticipate unexpected results.:

[integer pointer] points to 0x7ffe2d481324, which contains the integer a
[integer pointer] points to 0x7ffe2d481328, which contains the integer e
[integer pointer] points to 0x7ffe2d48132c, which contains the integer ▒
[integer pointer] points to 0x7ffe2d481330, which contains the integer
[integer pointer] points to 0x7ffe2d481334, which contains the integer
[char pointer] points to 0x7ffe2d481301, which contains the char '1'
[char pointer] points to 0x7ffe2d481302, which contains the char '0'
[char pointer] points to 0x7ffe2d481303, which contains the char '0'
[char pointer] points to 0x7ffe2d481304, which contains the char '0'
[char pointer] points to 0x7ffe2d481305, which contains the char '2'

Even though int_pointer points to character data that only contains 5 bytes of data, it is still typed as an integer. This means that adding 1 to the pointer will increment the address by 4 each time. Similarly, the char_pointer's address is only incremented by 1 each time, stepping through the 20 bytes of integer data, one byte at a time. So, we need to make sure that pointer type is correct. This is the place where we need typecasting.

#include <stdio.h>

int main()
{
  int i;

  char char_array[5] = {'a', 'b', 'c', 'd', 'e'};
  int int_array[5] = {1, 2, 3, 4, 5};

  char *char_pointer;
  int *int_pointer;

  char_pointer = (char *) int_array;
  int_pointer = (int *) char_array;

  for(i = 0; i < 5; i++)
  {
    printf("[integer  pointer] points to %p, which contains the integer %c\n", int_pointer, *int_pointer);
    int_pointer = (int *)((char *)int_pointer + 1);
  }

  for(i = 0; i < 5; i++)
  {
    printf("[char pointer] points to %p, which contains the char '%d'\n", char_pointer, *char_pointer);
    char_pointer = (char *)((int *)char_pointer + 1);
  }
}

Typecasting is just a way to change the type of a variable on the fly. In the above code, when the pointers are initially set, the data is typecast into the pointer's data type. This will prevent the C compiler from complaining about the conflicting data types; however, any pointer arithmetic will still be incorrect (because typecasting is just for that one operation). To fix that, when 1 is added to the pointers, they must first be typecast into the correct data type so the address is incremented by the correct amount. Then this pointer needs to be typecast back into the pointer's data type once again. It works but in a not beautiful way.:

[integer pointer] points to 0x7ffd484ac470, which contains the integer a
[integer pointer] points to 0x7ffd484ac471, which contains the integer b
[integer pointer] points to 0x7ffd484ac472, which contains the integer c
[integer pointer] points to 0x7ffd484ac473, which contains the integer d
[integer pointer] points to 0x7ffd484ac474, which contains the integer e
[char pointer] points to 0x7ffd484ac450, which contains the char '1'
[char pointer] points to 0x7ffd484ac454, which contains the char '2'
[char pointer] points to 0x7ffd484ac458, which contains the char '3'
[char pointer] points to 0x7ffd484ac45c, which contains the char '4'
[char pointer] points to 0x7ffd484ac460, which contains the char '5'

Sometimes, we probably want to use a generic, typeless pointer. In C, a void pointer is a typeless pointer, defined by the void keyword. Here are two things we need to note:

pointers cannot be de-referenced unless they have a type. In order to retrieve the value stored in the pointer's memory address, the compiler must first know what type of data it is.

void pointers must also be typecast before doing pointer arithmetic, which indicates that a void pointer's main purpose is to simply hold a memory address.

Let's rewrite our program.

#include <stdio.h>

int main()
{
  int i;

  char char_array[5] = {'a', 'b', 'c', 'd', 'e'};
  int int_array[5] = {1, 2, 3, 4, 5};

  void *void_pointer;

  void_pointer = (void *)char_array;

  for(i = 0; i < 5; i++)
  {
    printf("[char pointer] points to %p, which contains the char %c\n", void_pointer, *((char *)void_pointer));
    void_pointer = (void *)((char *)void_pointer + 1);
  }

  void_pointer = (void *)int_array;

  for(i = 0; i < 5; i++)
  {
    printf("[integer pointer] points to %p, which contains the integer %d\n", void_pointer, *((int *)void_pointer));
    void_pointer = (void *)((int *) void_pointer + 1);
  }
}

The output is:

[char pointer] points to 0x7fff06cf8de0, which contains the char a
[char pointer] points to 0x7fff06cf8de1, which contains the char b
[char pointer] points to 0x7fff06cf8de2, which contains the char c
[char pointer] points to 0x7fff06cf8de3, which contains the char d
[char pointer] points to 0x7fff06cf8de4, which contains the char e
[integer pointer] points to 0x7fff06cf8dc0, which contains the integer 1
[integer pointer] points to 0x7fff06cf8dc4, which contains the integer 2
[integer pointer] points to 0x7fff06cf8dc8, which contains the integer 3
[integer pointer] points to 0x7fff06cf8dcc, which contains the integer 4
[integer pointer] points to 0x7fff06cf8dd0, which contains the integer 5

The void pointer is really just holding the memory addresses, while the hard-coded typecasting is telling the compiler to use the proper types whenever the pointer is used. Since the type is taken care of by the typecasts, the void pointer is truly nothin more than a memory address. With the data types defined by typecasting, anything that is big enough to hold a four-byte or eight-byte value can work the same way as a void pointer.

#include <stdio.h>

int main()
{
  int i;

  char char_array[5] = {'a', 'b', 'c', 'd', 'e'};
  int int_array[5] = {1, 2, 3, 4, 5};

  unsigned long int hacky_nonpointer;

  hacky_nonpointer = (unsigned long int)char_array;

  for(i = 0; i < 5; i++)
  {
    printf("[hacky_nonpointer] points to %p, which contains the char %c\n", hacky_nonpointer, *((char *)hacky_nonpointer));
    hacky_nonpointer = hacky_nonpointer + sizeof(char);
  }

  hacky_nonpointer = (unsigned long int)int_array;

  for(i = 0; i < 5; i++)
  {
    printf("[hacky_nonpointer] points to %p, which contains the integer %d\n", hacky_nonpointer, *((int *)hacky_nonpointer));
    hacky_nonpointer = hacky_nonpointer + sizeof(int);
  }
}

Note that I use unsigned long int because I'm on a 64-bit system. unsigned int is enough for 32-bit system.:

[hacky_nonpointer] points to 0x7fff3e378360, which contains the char a
[hacky_nonpointer] points to 0x7fff3e378361, which contains the char b
[hacky_nonpointer] points to 0x7fff3e378362, which contains the char c
[hacky_nonpointer] points to 0x7fff3e378363, which contains the char d
[hacky_nonpointer] points to 0x7fff3e378364, which contains the char e
[hacky_nonpointer] points to 0x7fff3e378340, which contains the integer 1
[hacky_nonpointer] points to 0x7fff3e378344, which contains the integer 2
[hacky_nonpointer] points to 0x7fff3e378348, which contains the integer 3
[hacky_nonpointer] points to 0x7fff3e37834c, which contains the integer 4
[hacky_nonpointer] points to 0x7fff3e378350, which contains the integer 5

The important thing to remember about variables in C is that the compiler is the only thing that care about a variable's type. In the end, after the program has been compiled, the variables are nothing more than memory addresses. This means that variables of one type can easily be coerced into behaving like another type by telling the compiler to typecast them into the desired type.

Difference between i++ and ++i

2017-01-11T22:39:00+08:00

I'm starting to read through Hacking: The Art of Exploitation on my four hours daily commute to work in order to get myself more comfortable working with C. I resolve a puzzle I have about C, as shown in the title.

i++ means increment the value of i by 1 after evaluating the arithmetic operation.

int a, b;
a = 5;
b = a++ * 6;

b will contain 30 and a will contain 6, since the shorthand of b = a++ * 6; is the equivalent to the following statements:

b = a * 6;
a = a + 1;

++i means increment the value of i by 1 before evaluating the arithmetic operation.

int a, b;
a = 5;
b = ++a * 6;

b will contain 36 and a wil contain 6, since the shorthand of b = ++a * 6; is the equivalent to the following statement:

a = a + 1;
b = a * 6;

In fact, the two principles mentioned above apply more than "evaluating the arithmetic operation". For example, in the stack push operation:

/* Push an element on the Stack
 */
void
push (ET elem, Stack S)
{
  if (isFull(S))
  {
    resizeStack(S);
  }
  S->Array[++S->TopOfStack] = elem;
}

we can do S->Array[++S->TopOfStack] = elem;, which is equivalent with the following, a nice short verison:

S->Array[S->TopOfStack] = elem;
S->TopOfStack = S->TopOfStack + 1;

Another example is stack topAndPop:

/* Check the top element and pop it out of Stack
 */
ET
topAndPop(Stack S)
{
  return S->Array[S->TopOfStack--];
}

In this case, we essentially do:

ET a = S->Array[S->TopOfStack];
S->TopOfStack = S->TopOfStack - 1;
return a;

Look how clean I can make my code is if I can understand the difference between ++i and i++.

--- 01/19/2017 UPDATE ---

++i and i++ is really a powerful technique to shorten the C code. However, it can be error-prune if we are not careful enough.

Let's take a look at the following code snippet, which is adapted from the program on K & R p.117.

main(int argc, char* argv[])
{
  while (--argc > 0 && (*++argv)[0] == '-')
    while (c = *++argv[0])
      switch (c)
      {
        case 'x':
          printf ("user invokes the program with -x option\n");
          break;
        case 'n':
          printf ("user invokes the program with -n option\n");
          break;
        default:
          printf("illegal option %c\n", c);
          argc = 0;
          break;
      }
   if (argc != 1)
     printf("Usage: find -x pattern\n");
}

This program itself is straightforward. Let's take a look at (*++argv)[0] to see what it means. Since argv is a pointer to the beginning of the array of argument strings, incrementing it by 1 (i.e. ++argv) makes it point at the original argv[1] instead of argv[0]. Then we dereference it to get the value of the argument string that we are currently looking at (i.e *++argv). Now, we get its first character by adding [0]. So, we have (*++argv)[0]. For example, we run our program with a.out -x -n pattern. Then our argv looks like {"-x", "-n", "pattern"}. Then argv[0] is "-x", argv[1] is "-n", and so on.

The reason we need parenthesis in (*++argv)[0] can be seen in the next line c = *++argv[0]. Since [] binds tighter than * and ++, then *++argv[0] is equivalent with c = *++(argv[0]). argv[0] points to the first char of the argument string that argv pointing at. Then we increment and dereference argv[0] to get the next character in the argument string. For instance, suppose argv pointing at -x. Then argv[0] pointing at - and we increment and dereference argv[0] to get x and assign to c.

We can see that the level of precedence of operators is crtical in this case. This can be seen from table on p.53. in K & R:

Let's see another example from K & R p. 105.

void strcpy(char *s, char *t)
{
  while ((*s++ = *t++) != '\0')
    ;
}

In this case, the value of *t++ is the character that t pointed to before t was incremented; the postfix ++ doesn't change t until after this character has been fetched. This makes sense if we consider it from precedence of the operators' view. * and ++ have the same precedence in our table. Thus, we evaluate the expression in ordinary order: from left to right. We first evaluate *s then we increment s.

Modify array inside function in C

2017-01-08T09:23:00+08:00

In this post, I want to write down the lesson learned about modifying array inside a function in C with an example from MAW 3.15.a:

Write an array implementation of self-adjusting lists. A self-adjusting list is like a regular list, except that all insertions are performed at the front, and when an element is accessed by a Find, it is moved to the front of the list without changing the relative order of the other items.

In general, there are two cases when we need to use functions to work with array. Let's examine accordingly.

Modify the array content

Let's take a look at the following sample function first:

void change(int *array,int length)
{
  printf("array address inside function: %p\n", array);
  int i;
  for(i = 0 ; i < length ; i++)
      array[i] = 5;
}

and in our test function we do:

void test_change()
{
  int i, length = 3;
  int test[3] = {1,2,3};

  printf("Before:");
  print(test, length);
  printf("before change, test address: %p\n", test);
  change(test, 3);
  printf("After:");
  print(test, length);
  printf("after change, test address: %p\n", test);
}

The output looks something like:

Before:1 2 3
before change, test address: 0x7fffffffe050
array address inside function: 0x7fffffffe050
After:5 5 5
after change, test address: 0x7fffffffe050

Let's examine our change function under gdb.:

p array
$1 = (int *) 0x7fffffffe050

shows us that actually array is a pointer to int with the address 0x7fffffffe050.:

(gdb) p *0x7fffffffe050
$3 = 1

If we take a look at what value hold the address, we can see that it's 1, which is the first element of our int test[3] array. This leads to our very first important observation:

When pass an array to a function, it will decay to a pointer pointing to the first element of the array. In other words, we can do p *array in gdb and get 1 as well.

Since the size of int under my system is 4 bytes (check by p sizeof(int) in gdb), and let's examine the four conseuctive bytes with starting address 0x7fffffffe050:

(gdb) x/4bx array
0x7fffffffe050: 0x01    0x00    0x00    0x00

As you can see, this is integer 1. Now, let's start with the first iteration of the loop in change. Once we finish the iteration, i becomes 1 and let's see what change to our array:

(gdb) p array[0]
$12 = 5

(gdb) p array
$13 = (int *) 0x7fffffffe050

(gdb) p *0x7fffffffe050
$10 = 5

(gdb) x/4bx array
0x7fffffffe050: 0x05    0x00    0x00    0x00

We can see that the first element of our test array becomes 5 and the starting address of our array is still 0x7fffffffe050. In other words, the only thing changed is the value that address 0x7fffffffe050 holds. In addition, if you take a look at the array address output, you can see that before the function call, during the function call, and after the function call, the array address doesn't change at all: 0x7fffffffe050. This leads to our second observation:

We can change the contents of array in the caller function (i.e. test_change()) through callee function (i.e. change) by passing the the value of array to the function (i.e. int *array). This modification can be effective in the caller function without any return statement.
However, doing so, we doesn't change the address of the array. It seems that array is a local variable inside both caller function and callee function. Its address is copied and passed from test_change to change:
```
Inside change:

                 +---+---+--+
array ----->  -> | 1 | 2 | 3|
             /-> +---+---+--+
test --------
```

Let's verify above observation with another function change2:

void change2(int *array,int length)
{
  printf("array address inside function: %p\n", array);
  int i;
  int tmp[3] = {5,5,5};
  array = tmp;
}

With the similar test program test_change2() we get the following output:

TEST: change2
Before:1 2 3
before change, test address: 0x7ffda5b41bc0
array address inside function: 0x7ffda5b41bc0
After:1 2 3
after change, test address: 0x7ffda5b41bc0

change2 is very tempting because we assign array points to tmp, which let test inside test_change2 points to tmp as well. However, this is wrong and the output confirms our observation above: array is local variable to the caller function and callee function, and when we pass a array into a function, the address is passed (copied) from caller to callee. After that, address inside callee can reassign and will have no effect on the array (address) in caller. In other words, even though the address inside change2 and test_change2 are the same, but they are independent with each other:

after change2:

                 +---+---+--+
test  ---------> | 1 | 2 | 3|
                 +---+---+--+

                 +---+---+--+
tmp   ----->  -> | 5 | 5 | 5|
             /-> +---+---+--+
array -------

What if we want to modify test itself inside test_change2 beyond the content of the array. What if we want to resize the array to make it hold more values?

Modify the array itself

Before we start to answer the above question. Let me clear out an important concept: "array on stack" and "array on heap".

"array on Stack" with the declaration looks like int test[3] = {1,2,3} in our test routines. The array declared like this stays on the stack and local to the function calls. "array on heap" is the dynamic array involving malloc, which I mention in the previous post. When we talk about resize the array, we mean the latter case. In other words, we can only change the array itself (number of elements) with dynamically allocated array in the heap.

Let's take a look at change3:

void
change3(int **array, int length)
{
  int* tmp = calloc(length, sizeof(int));
  int i;
  for (i = 0; i < length; i++)
  {
    *(tmp+i) = 5;
  }
  free(*array);
  *array = tmp;
}

and our corresponding test routine test_change3():

void test_change3()
{
  printf("TEST: change3\n");
  int i, length = 3;
  int* test = calloc(length, sizeof(int));
  test[0] = 1;
  test[1] = 2;
  test[2] = 3;
  printf("Before:");
  print(test, length);
  printf("before change, test address: %p\n", test);
  change3(&test, length);
  printf("After:");
  print(test, length);
  printf("after change, test address: %p\n", test);
}

The first task is to understand int **array. There is a template sentence when comes to C type declaration: "<VariableName> is ... <typeName>". In our case, The template sentence becomes "array is ... int". Now let's work out the "..." with "right-left" rule:

"go right when you can, go left when you must"

In our case, we start with "array" and go right, and nothing left with declaraiton. So, we must go left. the first symbol is *, which reads as "pointer to". So now our template sentence becomes "array is pointer to ... int". Great! Let's continue to go left, we see another *, which makes our sentence becomes "array is pointer to pointer to ... int". Then we meet int, which means all the symbol in the declaration is consumed and our sentence is complete: "array is pointer to pointer to int". This means array variable itself is a pointer containing an address of a pointer, which holds an address of a int.

Let's see if this is true with gdb.:

(gdb) p array
$1 = (int **) 0x7fffffffe070

(gdb) p/a *0x7fffffffe070
$8 = 0x601010

(gdb) p *0x601010
$7 = 1

(gdb) p *array
$2 = (int *) 0x601010

(gdb) p **array
$3 = 1

The address holds by array is 0x7fffffffe070. We further examine the value holds by 0x7fffffffe070 and by our assumption, it should be another address and it is: 0x601010. Then, we check the value hold by that address, which is expected 1 the first element of our test array.

Our goal is to let test array in test_change3() be 5,5,5:

Before change3

                 +---+---+--+
test  ---------> | 1 | 2 | 3|
                 +---+---+--+

                 +---+---+--+
tmp   ---------> | 5 | 5 | 5|
                 +---+---+--+


After change3

                       +---+---+--+
tmp   ---------------> | 5 | 5 | 5|
                   /-> +---+---+--+
test(array) -------

From the picture we can see that we want to modify array inside change3 pointing to 5,5,5 and this change will persist to the test array in our caller function. In other words, we want both test and array no longer independent but want them "tie up" as the same pointer with different names. How do we do that?

The solution is given by change3 but we really need to think about why it makes sense. Firstly, we want to use gdb to examine the address of key variables:

(gdb) p array
$4 = (int **) 0x7fffffffe070
(gdb) p *array
$5 = (int *) 0x601010
(gdb) p (*array)+1
$14 = (int *) 0x601014
(gdb) p (*array)+2
$15 = (int *) 0x601018
(gdb) p *(*array)
$18 = 1
(gdb) p *(*array)+1
$16 = 2
(gdb) p *(*array)+2
$17 = 3

(gdb) p tmp
$7 = (int *) 0x601030
(gdb) p tmp+1
$8 = (int *) 0x601034
(gdb) p tmp+2
$9 = (int *) 0x601038
(gdb) p *tmp
$10 = 5
(gdb) p *(tmp+1)
$11 = 5
(gdb) p *(tmp+2)
$12 = 5

We first print out the array address of each element and we print out the tmp address of each element. With the information above, let's compose our conceptual picture:

Before *array = tmp;

   4 bytes                                         4 bytes
+-----------+-----------+----------+------------+-----------+----------+--------+-------+----------+------
|  1        | 2         | 3        |   ...      |    5      |     5    |  5     |  ...  | 0x601010 | ...
+-----------+-----------+----------+------------+-----------+----------+--------+-------+----------+------
^           ^           ^                       ^           ^          ^                ^
0x601010   0x601014     0x601018                0x601030    0x601034   0x601048         0x7fffffffe070
                                                tmp                                     array

Now, let's execute *array = tmp, we get the following:

(gdb) p *array
$19 = (int *) 0x601010
(gdb) p *array
$20 = (int *) 0x601030

Now the picture looks like:

After *array = tmp;

   4 bytes                                         4 bytes
+-----------+-----------+----------+------------+-----------+----------+--------+-------+----------+------
|  1        | 2         | 3        |   ...      |    5      |     5    |  5     |  ...  | 0x601030 | ...
+-----------+-----------+----------+------------+-----------+----------+--------+-------+----------+------
^           ^           ^                       ^           ^          ^                ^
0x601010   0x601014     0x601018               0x601030    0x601034   0x601048        0x7fffffffe070
                                               tmp                                    array

We don't modify the address of the array itself (still 0x7fffffffe070) but the content that stored at 0x7fffffffe070 which is no longer 0x601010 but 0x601030, which is the starting address of the tmp: 5,5,5. This may seem like magic. However, in C, a variable (i.e. test in test_change3()) is merely a synonym for address. by invoking change3 through &test, we pass in the address 0x601010 via a carrier 0x7fffffffe070, and we modify the address to 0x601030 and send the address back again through carrier.

With this understanding, we can see why the output looks like:

TEST: change3
Before:1 2 3
before change, test address: 0x601010
After:5 5 5
after change, test address: 0x601030

Hoepfully, after our examination, we can understand arrayInsert for MAW 3.15.a proposed at the beginning of the post:

void
arrayInsert(int elem, int** list, int length)
{
  *list = realloc(*list, sizeof(int) * (length+1));
  int i;
  for (i = 0; i < length; i++)
  {
    (*list)[length - i] = (*list)[length-i-1];
  }
  *((*list)) = elem;
}

Get the complete source code.

Reference

If you would like to read more about decoding C type declarations. You can read more here:
- Reading C type declarations
- Right-left rule to understand C type declaration
- Chapter 3 in "Expert C Programming" by Peter Van Der Linden

Josephus Problem & Radix Sort Reflection

2017-01-06T00:34:00+08:00

This post is a reflection on the two problems (MAW 3.13 & 3.10) I have been working on for the past five days.

Dynamic arrays in C

One of the ways I try out to solve Josephus Problem is to use circular double linked list, which is implemented in ET circularDoubleLinkedListJosephus(ET N, ET M), Inside the function, here is what I try to do initially:

ET people[N] = {0};
for (i = 0; i < N; i++)
{
  people[i] = i + 1;
}

I try to make an array of consecutive numbers based upon the input N. However, this way doesn't work in C because compiler has no clue the size of array during compilation phase. This is what people called dynamic array in C. The following two pages offer execellent explanations to dynamic array specific and array in C in general:

Here is what I've done eventually:

people = calloc(N, sizeof(int));

for ( i = 0; i < N; i++)
{
  *(people + i) = i + 1;
}

There is one thing to note about calloc. It essentially the same as malloc in terms of allocating a chunk of array. However, unlike malloc, calloc will zero-initlize the chunk, which is quite useful when we work with array of integers. In other words, people = malloc(N*sizeof(int)); is perfectly fine in this case but calloc gives an advantage to have more control on array content, especially useful when we debug.

Circularly Linked Lists

In MAW, author is kind of in-rush when talks about this section. When comes to implementation, how we deal with header node need to carefully think through. This is what stated in the book:

A popular convention is to have the last cell keep a pointer back to the first. This can be done with or without a header (if the header is present, the last cell points to it) and can also be done with doubly linked lists (the first cell's previous pointer points to the last cell).

Here is my circularly double linked list in picture:

In words, our dummy node's Next points to the the first data node and Prev points to the last data node. With this setup, the head of the list can be accessed by dummy.Next and tail by dummy.Prev. In addition, there will never be NULL pointers in the data structure.

Initialize an array of structs

When implement the radix sort solution, I need to construct an array of buckets, with each bucket is a single linked list with a dummy node. Here is what I do:

static List
makeEmptyArrayOfNodes(numBuckets)
{
  Pos Buckets = malloc(numBuckets * sizeof(struct Node));

  int k;
  for (k = 0; k < numBuckets; k++)
  {
    Buckets[k].Next = NULL;
  }

  return Buckets;
}

Here are two points worth noting:

Pos is struct Node* (can be checked by ptype in GDB) when we allocate an array of struct Node.
There is difference between -> and .. In K&R Page 131, it says that:

If p is a POINTER to a structure, then p->member-of-structure refers to the particular member.

and . is used to directly access a structure member. In my case, since Buckets[k] with type struct Node, then I need to use .. However, if I want to use ->, I need to use (&Buckets[k])->Next.

For loop instead of while

I try to experiment different trick when I work on my algo. Here is what I try: use for loop instead of while:

deleteNode(ET elem, List L)
{
  Pos dummyL = L->Next;
  Pos dummyPrev = L;

  for(; dummyL != NULL; dummyPrev = dummyL, dummyL = dummyL->Next)
  {
    if (dummyL->Element == elem)
    {
      Pos tmp = dummyL;
      dummyPrev->Next = dummyL->Next;
      free(tmp);
      return;
    }
  }
}

Use system implementation if find, otherwise use my own version

I'm trying to use fls inside int cyclicShiftJosephus(int N, int M), which return the last (most significant) bit set in value and return the index of that bit. However, not all system has fls shipped by default. So, I implement my own version. But, I would prefer the program to use system version if it can find one. Otherwise, use mine.

One solution is to use #ifndef with the structure looks like

#ifndef fls
int fls(int mask) { ... }
#endi

Another solution is to use weak symbol. However, this solution may not be portable. Then, it looks something like this

int  __attribute__((weak)) fls(int mask){ .. }

If system fls is defined as strong, my fls implementation will be overridden.

Josephus problem

2016-12-31T20:24:00+08:00

Preface

This is actually MAW 3.10. I gradually realize how dense MAW is. In the previous problem, I write almost 500 lines of code. For this one, the problem is not really diffcult to solve if we implement a program that follows the game rule exactly. However, I figure it is a good chance to dig a little deeper to learn somewhat fully from the question.

Let's start to dive in.

Overview

I first describe the Josephus problem in general. Then, I present a closed form solution to solve a special case of the original problem. Afterwards, I present a recurrence solution to solve the general problem.

Josephus problem

The Josephus problem is the following game: $N$ people, numbered $1$ to $N$, are sitting in a circle. Starting at person 1, a hot potato is passed. the $M$th person holding the hot potato is eliminated, the circle closes ranks, and the game continues with the person who was sitting after the eliminated person picking up the hot potato. The last remaining person wins. Thus, if $M = 1$ and $N = 5$, players are eliminated in order, and player 5 wins. If $M = 2$ and $N = 5$, the order of elimination is $2$,$4$,$1$,$5$.

You can play with different $M$, $N$ on this site to get a better sense of the problem.

Note

MAW uses a problem description that is slight different than the problem description usually find online. In the book, he defines $M$ in term of number of passes. However, in our problem description, we use $M$ to indicate the $M$th person get eliminated. Here is an example to show the difference.In MAW description, $M = 0$ and $N = 5$, players are eliminated in order. However, in our own intepretation, $M$ should be $1$ in order to achieve the same elimination order. Similarly, in the book, $M = 1$ when we have $2$, $4$, $1$, $5$ elimination order for the second example. Mathematically, $M_{new} = M_{MAW} + 1$.

Josephus problem with $M = 2$

Let's first discuss a special case of the Josephus Problem: $M = 2$.

In the following, $n$ denotes the number of people in the initial circle, and $m$ denotes the count for each step. In other words, $m-1$ people are skipped and the $m$-th is eliminated. The people in the circle are numbered from $1$ to $n$. Our goal is to find $J(n,m)$, which denotes the survivor's number (i.e. $J(5,1) = 3$). For simplicity, let $F(n) = J(n,2)$.

One quick observation is that after the first go-round, we are left with the same problem but for a different number of people. For instance, when $n = 10$, after the first go-round, we eliminate $2$, $4$, $6$, $8$, $10$ and then we go to the second-round beginning with $3$, which is the same problem as the original one. The only difference is that the person with number $3$ in the first-round now becomes number $2$ in the second-round.

Case 1: When $n$ is even ...

Let $n = 2k$. After the first-round we are left with $k$ people, and we try to find out what is $F(k)$. In addition, by our observation, the numbering of people is changed. If $3$ is actually the answer (i.e. $F(2k) = 3$), then in the second-round the original person with $3$ now becomes $2$ (i.e. $F(k) = 2$). So, we have

$$ \begin{equation} F(2k) = 2F(k) - 1, \text{ for } k >= 1 \label{eq:1} \end{equation} $$

Case 2: When $n$ is odd ...

Let $n = 2k+1$. By the same reasoning as case 1, after the first-round, we still eliminate $k$ people. For instance, when $n = 9$, after the first-round, we elminate $2$, $4$, $6$, $8$, $1$. In other words, $1$ is eliminate just after person number $2k$. So, we have

$$ \begin{equation} F(2k+1) = 2F(k) + 1, \text{ for } k >= 1\label{eq:2} \end{equation} $$

So now our goal is to solve the recurrence equations \ref{eq:1} and \ref{eq:2} given $F(1) = 1$ to find a closed form. Let's do this by building a table of small values:

| n    | 1 | 2   3 | 4   5   6   7 | 8   9   10   11   12   13   14   15 | 16 |
|------|---|-------|---------------|-------------------------------------|----|
| F(n) | 1 | 1   3 | 1   3   5   7 | 1   3   5    7    9    11   13   15 | 1  |

We can group the columns by powers of $2$ (marked by vertical lines in the table); Inside each group, $F(n)$ is always $1$ at the beginning and then it increases by $2$ until the next group, which is the next power of $2$. So, for every number $n$, there exists an integer $a$ such that $2^a <= n < 2^{a+1}$. For some $0 <= l <= 2^a$, then $n = 2^a + l$. In other words, $2^a$ is the largest power of 2 not exceeding $n$ and $l$ is what's left. Then, from the table above, we may have the formula:

$$ \begin{equation} F(n) = F(2^a + l) = 2l + 1 \label{eq:3} \end{equation} $$

Now, let's prove equation \ref{eq:3} by induction on $a$.

Base case. When $a = 0$, we must have $l = 1$; thus we have $F(1) = 1$, which is true.
Induction. We use strong induction by assuming that the equation holds for all $a$ up to certain value. Let's consider this value of $a$. The induction steps has two parts, depending on whether $n$ (and thus $l$) is even or odd.
- If $2^a + l = 2k$, then
  
  $$ \begin{align*} F(2^a + l) &= 2F(2^{a-1} + l/2) - 1 &&\text{(by equation 1)} \\ &= 2(2l/2 + 1) - 1 &&\text{(by induction hypothesis)} \\ &= 2l + 1 \end{align*} $$
- If $2^a + l = 2k+1$, then
  
  $$ \begin{align*} F(2^a + l) &= 2F(2^{a-1} + (l-1)/2) + 1 &&\text{(by equation 2)} \\ &= 2(2(l-1)/2 + 1) + 1 &&\text{(by induction hypothesis)} \\ &= 2l + 1 \end{align*} $$

This completes induction step.

Let's revisit our closed form solution \ref{eq:3} again. Let's rewrite it into the form:

$$F(n) = 2 (n - 2^a) + 1$$

$n - 2^a$ is the same as zeroing the most significant bit of $n$. Then, we multiply the result with $2$, which is the same as shifting left one place, and adding $1$ is the same as setting the lowest bit to $1$. In other words, equation \ref{eq:3} is essentially do a one-bit cyclic shift left. Let's try to write this out formally. Let $n = (b_ab_{a-1}..b_1b_0)_2$, then we have:

$$ F(n) = F((b_ab_{a-1}..b_1b_0)_2) = (b_{a-1}...b_1b_0b_a)_2 \text{ and } b_a = 1$$

For a more rigorous derivation of this cyclic shift property, please reference Concrete Mathematics: A Foundation for Computer Science.

The way we solve Josephus problem with $M = 2$ is unlikely to be generalized for arbitrary $m$. Let's take $n = 10$, $m = 2$ example again. The reason we can derive the nice recurrence equations \ref{eq:1} and \ref{eq:2} is because our observation. Let's present our observation is a different way. $F(2k)$ denotes the old numbering before the first-round. $F(k)$ denotes the new numbering after the first-round.

      m = 2               m = 3
+-------+------+    +-------+------+
| F(2k) | F(k) |    | F(2k) | F(k) |
+-------+------+    +-------+------+
| 1     | 1    |    | 1     | 1    |
+-------+------+    +-------+------+
| 3     | 2    |    | 2     | 2    |
+-------+------+    +-------+------+
| 5     | 3    |    | 4     | 3    |
+-------+------+    +-------+------+
| 7     | 4    |    | 5     | 4    |
+-------+------+    +-------+------+
| 9     | 5    |    | 7     | 5    |
+-------+------+    +-------+------+
                    | 8     | 6    |
                    +-------+------+
                    | 10    | 7    |
                    +-------+------+

By looking at the table on the left, we can easily see that $F(2k) = 2F(k) - 1$. However, there is no nice clean linear relation that we can get between $F(2k)$ and $F(k)$ when $n = 10$, $m = 3$.

Note

Inside Concrete Mathematics: A Foundation for Computer Science, after talking about the solution to the Josephus problem, the author shift their focus to solve a generalized recurrence of \ref{eq:1} and \ref{eq:2}, which is (1.11) in the book. This has nothing to do with the Josephus problem and I'm guessing the reason why the author want to talk about the solution to the generalized recurrence is to illustrate dynamic programming philosophy.

General solution

The big picture here is we need to find out the relative position of the final survivor to the "first" person during each recursive call and then calculate the actual position for actual $n$

The general solution utilitizes the dynamic programming paradigm by performing the first step and using the solution of the subproblem we create to solve the initial problem. In terms of the solution, there is a difference when we start with the first person as $1$ or $0$.

Starting from 1

The key insight is the following: the result of $J(n,m)$ is best NOT thought of as the number that is the Josephus survivor, but rather as the index of the number that is the Josephus survivor.

Let's first take a look an example when $n = 6$ and $m = 2$.

fig.1

  1 2      1 X      1 X      1 X      1 X      X X
 6   3 -> 6   3 -> 6   3 -> X   3 -> X   X -> X   X
  5 4      5 4      5 X      5 X      5 X      5 X

fig.2

| index | 1 | 2 | 3 | 4 | 5 | 6 |
|-------|---|---|---|---|---|---|
| n = 6 | 1 | 2 | 3 | 4 | 5 | 6 | J(6,2) = 5
| n = 5 | 3 | 4 | 5 | 6 | 1 | 3 | J(5,2) = 3
| n = 4 | 5 | 6 | 1 | 3 | 5 | 6 | J(4,2) = 1
| n = 3 | 1 | 3 | 5 | 1 | 3 | 5 | J(3,2) = 3
| n = 2 | 5 | 1 | 5 | 1 | 5 | 1 | J(2,2) = 1
| n = 1 | 5 | 5 | 5 | 5 | 5 | 5 | J(1,2) = 1

fig.3

| index | 1 | 2 | 3 | 4 | 5 | 6 |
|-------|---|---|---|---|---|---|
| n = 6 | 1 | X | 3 | 4 | 5 | 6 | J(6,2) = 5 = (2-1 + 3) mod 6 + 1
| n = 5 | 3 | X | 5 | 6 | 1 | 3 | J(5,2) = 3 = (2-1 + 1) mod 5 + 1
| n = 4 | 5 | X | 1 | 3 | 5 | X | J(4,2) = 1 = (2-1 + 3) mod 4 + 1
| n = 3 | 1 | X | 5 | 1 | X | 5 | J(3,2) = 3 = (2-1 + 1) mod 3 + 1
| n = 2 | 5 | X | 5 | X | 5 | X | J(2,2) = 1 = (2-1 + 1) mod 2 + 1
| n = 1 | 5 | 5 | 5 | 5 | 5 | 5 | J(1,2) = 1

By looking at fig.1, we know that $J(6,2) = 5$. Now, if we take a look at fig.2, the row with $n = 5% shows that $J(5,2) = 3$. By the insight, $3$ here means the index not the number. So, our final survivor is $5$, which is positioned on $3$ in this row.

Let's generalize the example a little bit. Suppose we want to know $J(n,2)$. You can imagine we have $n$ people lined up like this:

1 2 3 4 5 ... n

The first thing that happens is that person $2$ get eliminated, as shown here:

1 X 3 4 5 ... n

Now, we are left with a subproblem of the following form: there are $n - 1$ people remaining, every other person is going to be eliminated, and the first person who will start to pass potato is person $3$. In other words, the subproblem $J(n-1, 2)$ now looks like:

3 4 5 ... n 1

$J(n-1, 2)$ will be the index of who survives in a line of $n - 1$ of people. Given that we have the index of the person who will survive, and we also know who the starting person is, we can determine which person will be left. Here's how we'll do it.

The starting person in this line is the person who comes right after the person who was last executed. This will be person $3$. The 1-indexed position in the ring of $n-1$ people is given by $J(n-1, 2)$. We can then walk forward $J(n-1, 2)$ positions, wrapping around the ring if necessary, to get our final position. In other words, the survivor is given by position

$$ \begin{equation} (3 + J(n-1, 2) - 1) \bmod n \label{eq:4} \end{equation} $$

Let's take a look at $n = 5$ in fig.2 again. Now, the starting position is $3$ and we walk forward by $J(5,2) - 1$ steps (i.e. $2$ steps) and we get the final survivor, which is $5$. The reason we doing $\bmod n$ is because we want to keep final survivor within the bounds of the circle.

However, there is a problem with our equation \ref{eq:4}. If we are indeed using one-indexing, what happens if the final survivor is at position $n$? In that case, we would accidentally get back position $0$ as our answer, but we really want position $n$. For example, suppose $J(5,2) = 4$. In other words, the final survivor is $6$, which is positioned at $4$ when $n = 5$. Then, to apply equation \ref{eq:4}, we get $0$, which is not $6$.

To fix this issue, we'll use a trick for using mod to wrap around with one-indexing: we'll take the inside quantity (the one-indexed position) and subtract one to get the zero-indexed position. We'll mod that quantity by $n$ to get the zero-indexed position wrapped around. Finally, we'll add back one to get the one-indexed position, wrapped around. That looks like:

$$(3 + J(n-1, 2) - 2) \bmod n + 1$$

In other words, $-2$ term comes from two independent $-1$'s: the first $-1$ is because $J(n-1, 2)$ returns a one-indexed index, so to step forward by the right number of positions we have to take $J(n-1,2) - 1$ steps forward. The second $-1$ comes from the fact that we're using one-indexing rather than zero-indexing.

Now, we're finally ready to generalize the solution to arbitrary $m$, not just $m = 2$. After person $m$ get eliminated, we have an array like this:

1 2 3 ... m-1 X m+1 ... n

We now essentailly need to solve a subproblem where person $m+1$ comes first:

m+1 m+2 ... n 1 2 ... m-1

So we compute $J(n-1, m)$ to get the one-indexed survivor of a ring of $n-1$ people, then shift forward by that many steps:

$$(m+1 + J(n-1, m) - 1)$$

We need to worry about the case where we wrap around, so we need to mod by $n$:

$$(m+1 + J(n-1, m) - 1) \bmod n$$

However, we're one-indexed, so we need to use the trick of subtracing $1$ from the inside quantity and then adding $1$ at the end:

$$(m+1 + J(n-1, m) - 2) \bmod n + 1$$

which simplifies to:

$$(m-1 + J(n-1, m)) \bmod n + 1$$

Notice that $J(1,m) = 1$, which indicates that we're one-indexed.

Starting from 0

Since we are not in zero-indexed. Our $J(6,2)$ example looks like the following:

| index | 0 | 1 | 2 | 3 | 4 | 5 |
|-------|---|---|---|---|---|---|
| n = 6 | 1 | 2 | 3 | 4 | 5 | 6 | J(6,2) = 4 = (2 + 2 ) mod 6
| n = 5 | 3 | 4 | 5 | 6 | 1 | 3 | J(5,2) = 2 = (0 + 2 ) mod 5
| n = 4 | 5 | 6 | 1 | 3 | 5 | 6 | J(4,2) = 0 = (2 + 2 ) mod 4
| n = 3 | 1 | 3 | 5 | 1 | 3 | 5 | J(3,2) = 2 = (0 + 2 ) mod 3
| n = 2 | 5 | 1 | 5 | 1 | 5 | 1 | J(2,2) = 0 = (0 + 2 ) mod 2 
| n = 1 | 5 | 5 | 5 | 5 | 5 | 5 | J(1,2) = 0

Let's apply the same logic from the one-indexed case. After person $m-1$ get eliminated, we have an array like this:

0 1 2 ... m-2 X m ... n-1

We now essentailly need to solve a subproblem where person $m$ comes first:

m m+1 ... n-1 0 1 2 ... m-2

So, we compute $J(n-1,m)$ to give us the zero-indexed survivor of a ring of $n-1$ people and we shfit forward by that many steps:

$$(m + J(n-1,m))$$

We take care of wrapping around by mod $n$:

$$(m + J(n-1,m)) \bmod n$$

Since we are zero-indexed, we are done. If we want to transform our answer to one-indexed, we can do:

$$(m + J(n-1,m) \bmod n + 1$$

Note that $J(1,m) = 0$ in this case, which indicates that we're zero-indexed.

What's left out

Equivalence Class Solution is interesting to check out.
Rank tree as a data sturcture is worth to check out to solve this problem. This paper gives a more detailed analysis.

A peek in code optimization

2016-12-28T13:21:00+08:00

Quite often, when I take a look at a programming question solution, I'm amazed by how succint the provided solution is. However, it is also known that getting an "optimized" solution is often taking iterative approach. This is something that I didn't realize until I start to work in the industry.

This post is mainly a reminder to keep reminding myself about this point: We don't have to give a perfect solution right away. We can provide a solution and gradually make it better.

The example I show here is integerList add(integerList A, integerList B), which is part of MAW 3.9 integer arithmetic package question

integerList
add(integerList A, integerList B)
{
  PtrToNode dummyA = A->NextDigit;
  PtrToNode dummyB = B->NextDigit;
  integerList R = malloc(sizeof(struct Node));
  PtrToNode dummyR = R;
  int digitSum = 0;
  int carry = 0;

  while (dummyA != NULL && dummyB != NULL)
  {
    digitSum = dummyA->Digit + dummyB->Digit + carry;
    if (digitSum < 10)
    {
      addDigit(digitSum, dummyR);
      carry = 0;
    }
    else
    {
      carry = 1;
      addDigit(digitSum-10, dummyR);
    }
    dummyA = dummyA -> NextDigit;
    dummyB = dummyB -> NextDigit;
    dummyR = dummyR -> NextDigit;
  }

  // example case: 342 + 706
  if (carry == 1 && dummyA == NULL && dummyB == NULL)
  {
    addDigit(carry, dummyR);
    dummyR = dummyR->NextDigit;
  }

  while(dummyA != NULL)
  {
    addDigit(dummyA->Digit + carry, dummyR);
    carry = 0;
    dummyA = dummyA->NextDigit;
    dummyR = dummyR->NextDigit;
  }

  while(dummyB != NULL)
  {
    addDigit(dummyB->Digit + carry, dummyR);
    carry = 0;
    dummyB = dummyB->NextDigit;
    dummyR = dummyR->NextDigit;
  }

  return R;
}

The idea for this first iteration solution stems from MAW 3.5 List unionSortedLists(List L, List P):

Given two sorted lists, L and P, write a procedure to compute L1 union L2 using only the basic list operations.

Since we put the least significant digit as the very first data node and we put the most significant digit as the last data node, we walk through the list. If you compare this routine with unionSortedLists routine, you can easily find that both routine structure is composed of three while loops. This makes sense because union and add are extremely similar mathematically.

First we start by adding the unit digit. If both numbers have the same number of digits, then we are done afte the first while loop. There is a special case where we still have a carry after we processed all the digits. If number of digits for two numbers are not the same, then we just move extra digits to the result.

Let's see how we can optimize this code.

In the solution, we build the case around the number of digits that operands have. However, this is necessary because in the case that two numbers have different number of digits, we can add leading zeros to the beginning of the number with fewer digits. This will make adding two numbers with different number of digits the same as adding two numbers with the same number of digits. So, we eliminate the latter two while loops and only need to keep the first while loop in the original solution.

Here is the final result.

integerList
add(integerList A, integerList B)
{
  PtrToNode dummyA = A->NextDigit;
  PtrToNode dummyB = B->NextDigit;
  integerList R = makeEmpty();
  PtrToNode dummyR = R;
  int digitSum = 0;
  int carry = 0;
  int x, y;

  while (dummyA != NULL || dummyB != NULL)
  {
    (dummyA != NULL) ? (x = dummyA->Digit) : (x = 0);
    (dummyB != NULL) ? (y = dummyB->Digit) : (y = 0);

    digitSum = x + y + carry;
    carry = digitSum / 10;
    addDigit(digitSum % 10, dummyR);

    if (dummyA != NULL) dummyA = dummyA -> NextDigit;
    if (dummyB != NULL) dummyB = dummyB -> NextDigit;
    dummyR = dummyR -> NextDigit;
  }

  // example case: 342 + 706
  if (carry == 1)
  {
    addDigit(carry, dummyR);
    dummyR = dummyR->NextDigit;
  }

  return R;
}

Reflection on integer arithmetic package problem

2016-12-26T23:03:00+08:00

This weekend, I'm working on MAW 3.9. The single problem results in almost 500 lines of code. This is quite unexpected. The problem is stated as the following:

Write an arbitrary-precision integer arithmetic package. You should use a strategy similar to polynomial arithmetic. Compute the distribution of the digits $0$ to $9$ in $2^{4000}$.

This post is the reflection about this problem.

Which way to go?

Since the problem states "arbitrary-precision" and "use a strategy similar to polynomial arithmetic", then I can conclude that linked list is the best data structure for this problem. However, the question is how we can construct the linked list to best implement our integer arithmetic operations (i.e. addition, mulitiplication)?

We essentially have two options:

We put the most significant digit as the the very first data node and we put the least significant digit as the last data node. For example, for a number $123$, we will implement it like dummy->1->2->3.
This is the exactly opposite of the first option. We put the least significant digit as the very first data node and we put the most significant digit as the last data node. Again, for $123$, we will implement is like dummy->3->2->1.

Let's evaluate these two options from two perspective:

Whether we can easily construct a linked list to represent arbitrary-precision integer?
Whether the arithmetic operations are essy to implement?

From the first perspective, for option one, each time we add a new digit to the most significant position, we insert a new node at the very beginning of the list (i.e. right after the header node). On the other hand, for option two, we append a new node at the very end of the list. Since we design our addDigit with an input of a pointer to node (i.e. to specify where to add node), these two options work equally well.

From the second perspective, things are different. Take arithmetic addition as an example. When we try to add two numbers, for option one, we need to walk through the whole list to begin with the very end of the node because we want to start with unit digit. This makes our routine complex because we need to use a while loop to walk through the list first. For second option, situation is easier becauuse the number is implemented in the reverse order in the list. The very first data node is the unit digit and we can directly start with addition while we move towards the end of the list. If we need to add additional node because of carry (i.e. $999 + 1$ will be no longer 3-digit but 4-digit number), we can naturally pass the pointer pointing towards the current node to the addDigit function.

So, we choose option two to implement our integer package.

Memory leak

Memory leak is a very important issue to pay attention to during the testing phase. We use valgrind to help us detect if there is any leak in our code. You can reference their quick start guide and memory check user manual for the commands and error shooting.

Here are the two mistakes I made (You can check out my commit about memory leak debug):

Always free the chunk allocated by malloc whenever possible.

Take multiply function as an example:

 integerList
 multiply(integerList A, integerList B)
 {
   PtrToNode dummyA = A->NextDigit;
   PtrToNode dummyB = B->NextDigit;

   integerList tmpR = makeEmpty();
   PtrToNode dummyTmpR = tmpR;

   integerList R = makeEmpty();

   int product, carry = 0;
   int i, indent = 0;

   while (dummyA != NULL)
   {
     while (dummyB != NULL)
     {
       product = dummyA->Digit * dummyB->Digit + carry;
       carry = product / 10;
       addDigit(product % 10, dummyTmpR);
       dummyTmpR = dummyTmpR->NextDigit;
       dummyB = dummyB->NextDigit;
     }

     if (carry > 0)
     {
       addDigit(carry, dummyTmpR);
       dummyTmpR = dummyTmpR->NextDigit;
     }

     for(i = 0; i < indent; i++)
     {
       addDigit(0,tmpR);
     }

     integerList tmp = R; // prevent memory leak
     R = add(R, tmpR);
     deleteAll(tmp);

     indent ++;
     carry = 0;
     deleteIntegerList(tmpR);
     dummyTmpR = tmpR;
     dummyA = dummyA->NextDigit;
     dummyB = B->NextDigit;
   }

   deleteAll(tmpR);
   return R;
 }

We allocate tmpR through makeEmpty() in Line[7]. If we don't do anything about it inside the function, then the memory will be lost because we have no way to reference this chunk of memory outside the function. Local variable tmpR is the only reference to the memory allocated on the heap. However, once the function is done, the local variable is destroyed from the stack, and thus, we lose our only reference to the memory chunk. So, we need to free it before we exit the function (Line[49]).

Be careful with a function call inside a function call.

This type of leak is much more subtle than the first one. Originally instead of

integerList tmp = R;
R = add(R, tmpR);
deleteAll(tmp);

I only have R = add(R, tmpR). This cause the leak because of the following reasoning: Originally, we have R points to a list of nodes. When we do add(R,tmpR), we create a new list of nodes, which hold our addition result. Then we let R points towards this newly-created list. This makes us lose the list of nodes originally pointed by R. That's why we introduce tmp.

makeEmpty ?

Originally, I don't have this makeEmpty function:

integerList
makeEmpty()
{
  integerList R = malloc(sizeof(struct Node));
  R->NextDigit = NULL; // super important step
  return R;
}

If you take a look at this function, it seems to be a wrapper around malloc operation, which seems redundant (we could directly call malloc directly in the place that makeEmpty appears). However, the key for this routine is R->NextDigit = NULL;. This step can be easily omitted. However, without this step, we don't have fully control on what our newly-allocated empty list (i.e. a list with only header node) will look like. In other words, our header node will point to somewhere (i.e. R->NextDigit) randomly without our key step. This can cause serious trouble for the following routine debug. For example, we could have R->NextDigit holds some address value that happens to have a node structure there with a value in it. For instance, dummy->1. This can usually happen when you OS try to reuse the memory chunk you previously freed. For example, try the following experiment:

replace makeEmpty on Line[7] & line[10] in multiply function
multiply works fine with test_multiply() solely in the test program.
multiply won't work if we do test_intializeInteger() and test_add() before test_multiply() because the integer we construct will no longer be 342 in the test case but something like 3425, where 5 is some value pointed by R->NextDigit.

So, always clear out the pointer by setting it to NULL whenever we do initialization.

A small C trick I learned today

2016-12-24T23:11:00+08:00

Today I learned a C trick. Here is my original printList:

void
printList(List L)
{
  Pos dummy; // creates a dummy node to traverse the list

  dummy = L->Next;

  while (dummy != NULL)
  {
    printf("%d->", dummy->Element);
    dummy = dummy->Next;
  }
}

It works but there is a small caveat in this routine. This is part of print out for the linkedListTestMain:

TEST: printList
23->44->45->57->89->-1->

As you can see, there is a little -> at the end of linked list, which is not supposed to be there because there is no next element after -1.

I try to solve this problem but the solution is not succint and I don't want to do complicated stuff just to remove this ->. Howver, I finally get a solution today that is very clean to eliminate -> without adding additional complexity to the routine.

In C, we know we can use if-else shorthand likes the following:

int x;
if (dummyA != NULL)
{
  x = dummyA->Digit;
}
else
{
  x = 0;
}

is equivalent with

int x;
(dummyA != NULL) ? (x = dummyA->Digit) : (x = 0);

We can use this shorthand inside our routine printf statement to solve our problem:

void
printList(List L)
{
  Pos dummy; // creates a dummy node to traverse the list

  dummy = L->Next;

  while (dummy != NULL)
  {
    printf("%d%s", dummy->Element, (dummy->Next) ? ("->") : (""));
    dummy = dummy->Next;
  }
}

As you can see, inside printf statement, we don't printout -> by default, we check if dummy->Next is NULL, then that means we are at the last element of the list, and we don't append anything (i.e. ("")). However, if this is not the case, we print ->.

Print singly linked list in reverse order

2016-12-23T00:05:00+08:00

Today, during the lunch break, I take a look at the following problem:

Print a singly linked list in reverse order.

This is actually one of the interview questions I got at SAP for ABAP developer position (luckily, they didn't offer me the position). I didn't get the correct answer at that time and I think the problem may help me to kill some time during the break.

The question itself is not hard if you're familar with linked list and recursion philosophy:

static void
printListReverseHelper(List L)
{
  if (L == NULL)
  {
    return;
  }
  printListReverseHelper(L->Next);
  printf("%d->", L->Element);
}

void printListReverse(List L)
{
  Pos dummyL = L->Next;
  printListReverseHelper(dummyL);
}

Again, in our implementation of linked list, we use header node. Given the simiplicity of the problem, I think it is good time to revisit some basic rules in recursion.

To be honest, recursion always gives me hard time because I always try to mentally expand all the call stack and then work backwards to see if the recursion function gives what I expect. This is super energy consuming and error-prone.

However, things start to get better since I start to read MAW. Here are the four basic rules of recursion he emphasizes:

Base cases. You must always have some base cases, which can be solved without recursion.

Making progress. For the cases that are to be solved recursively, the recursive call must always be to a case that makes progress toward a base case.

Design rule. Assume that all the recursive calls work.

Compound interest rule. Never duplicate work by solving the same instance of a problem in separate recursive calls.

Among the four rules, No.3 rule is easily my most faviroite one. It is stated very simple but it has huge impact on how you think about recursion.

Let's use first three rules to analyze this problem a little bit.

Base cases. This problem is quite simple. The base case is the case when the list is empty. In this case, we have nothing to do and simply return.
Making progress. This is reflected when we call printListReverseHelper(L->Next). Each time we make the recursive call, we pass in L->Next, which makes the list shorter. This eventually will make the whole list empty, which is the base case.
Design rule. I use this rule to design the whole recursion function. Just imagaine a scenario like the following: Suppose you have a list of 1->2->3. Then, by the rule, we assume that the number 2 and 3 are already printed in reverse order. What we left to do is to print out 1 and then we done. We follow this thought process closely when we actually write the recursion function. After we write out the base case, we first write printListReverseHelper(L->Next); This is saying that the rest of list (except the first one) is already printed in reverse order (i.e. 2 and 3 in our case). Then we write printf("%d->", L->Element);. This says, ok, since we are only left with the first node, let's print it out and the job is done (i.e. 1 in our case).

See, how simple the recursion can be if we can actually get over psychological obstacle to expand the call stack mentally and directly apply four rules (especially the third rule) to design our function.

Environment variable substitution using Sed

2016-12-21T12:07:00+08:00

Suppose we have a text file config.ini looks something like this:

[MSSQLSERVER]
Driver=INSTHOME/foo/foo.so

[SYBASE]
Driver=INSTHOME/bar/bar.so

...

We want to replace all the appearance of INSTHOME with the value we hold in $HOME. Here is what I do initially:

sed -i -e "s/INSTHOME/$HOME/g" config.ini

s is used to replace the found expression INSTHOME with $HOME
g stands for "global", which means to do this find & replace for the whole line. If you leave off the g and INSTHOME appears twice on the same line, only the first INSTHOME is changed to $HOME
-i is used to edit in place on filename
-e is to indicate the expression/command to run

Note

I use double quotes " to expand any variable appeard inside ". In this case, $HOME.

However, when I type this in and I got the following error:

sed: -e expression #1, char 13: unknown option to `s'

Why did this error happen? That confused me for a while. Then, I try to simulate what the program will do for the above expression:

sed -i -e "s/INSTHOME//home/iidev20/g" config.ini

Ah! This expansion result doesn't make sense at all because sed expression inside " needs to follow:

"s/[target_expression]/[replace_expression/g"

So, the first thought comes to me is to escape all / in the expression:

sed -i -e "s/INSTHOME/\/home\/iidev20/g" config.ini

This can work but it has two severe drawbacks:

I'm hardcoding the value. If $HOME no longer holds /home/iidev20, then my command breaks again, and this hinders portability.
The readability of this code is too bad. Probably okay for Perl programmer but still, not quite friendly.

To address these two issues, I find the following about GNU sed:

%regexp%

(The % may be replaced by any other single character.)

This also matches the regular expression regexp, but allows one to use a different delimiter than /. This is particularly useful if the regexp itself contains a lot of slashes, since it avoids the tedious escaping of every /. If regexp itself includes any delimiter characters, each must be escaped by a backslash ().

Essentially, we don't have to use / as our delimiter for the expression, especially when the pattern itself contains a lot of slashes (i.e. file path in my case).

so, I decide to use | as the delimiter:

sed -i "s|INSTHOME|$HOME|g" config.ini

Note

I can also use single quote ' but the command should be modified like the below by leaving out to-be-expanded variable name outside of single quotes.

sed -i 's|INSTHOME|'$HOME'|g' config.ini

Now, everything works nice and clean.

What's the difference between sourcing a script and executing a script?

2016-12-20T21:49:00+08:00

I run across the question in the title when I take a break from the work today. Then I did a little bit googling, and the explanation is not quite satisfying to me. So, I decide to answer this question by a simplied example from my work.

For me, this question appears frequently when you try to install some software. Some software, like the product I'm working on, depends on a set of environment variables in order to setup itself properly. Usually, this may inovlve manual editing of the environment variables in order to make the product work. However, we can do much better. We can somehow let a setup program to edit the environment variable for the user and finish the whole product setup process automatically.

Suppose a software relies on an environment variable TEST_SOURCE and we don't have such an environment variable initially.

$ echo $TEST_SOURCE
$

If we create a test script test.sh like the following:

#!/bin/sh

export TEST_SOURCE=HELLO

Then we can have two way to execute this script: either by ./test.sh or by source test.sh and they two have different outcome:

$ ./test.sh
$ echo $TEST_SOURCE
$
$ source test.sh
$ echo $TEST_SOURCE
HELLO

So, the conclusion is that when we execute in source, we actually run program in the current shell. However, if we execute in ./, then we run the program in a separately shell and the execution (i.e. modify environment variable) doesn't impact our current shell.

Polynomial Multiplication

2016-12-18T18:53:00+08:00

I finally got time to continue working through MAW. The problem 3.7 relates to polynomial multiplication.

Problem

Write a function to multiply two polynomials, using a linked list implementation. You must make sure that the output polynomial is sorted by exponent and has at most one term of any power.

Give an algorithm to solve this problem in $O(M^2N^2)$ time.

Write a program to perform the multiplication in $O(M^2N)$ time, where $M$ is the number of terms in the polynomiial of fewer terms.

Write a program to perform the multiplication in $O(MNlog(MN))$ time.

Which time bound above is the best?

Solution

Question 1

The first question is quite straightforward. We keep the result in a linked list with exponent sorted in descending order. Each time a multiply is performed, we search through the result linkedlist for the term with the same exponent as ours. If so, we simply add coefficients together. If not, we add our product as a new term.

Polynomial
multiply1(Polynomial A, Polynomial B)
{
Polynomial R = malloc(sizeof(struct Node));
PtrToNode dummyRPrev = R;
PtrToNode dummyR = R;
PtrToNode dummyA = A->Next;
PtrToNode dummyB = B->Next;

int tmpExponent, tmpCoefficient;

while (dummyA != NULL)
{
    while (dummyB != NULL)
    {
      tmpExponent = dummyA->exponent + dummyB->exponent;
      tmpCoefficient = dummyA->coefficient * dummyB->coefficient;

      // we go through the output polynomial to see if there is
      // a term with the same exponent as our tmpExponent.
      while (dummyR != NULL)
      {
        if (dummyR->exponent == tmpExponent)
        {
          dummyR->coefficient = dummyR->coefficient + tmpCoefficient;
          break;
        }
        else
        {
          dummyRPrev = dummyR;
          dummyR = dummyR->Next;
        }
      }

      // We couldn't find the term with the same exponent, so we create
      // a new term in our output polynomial.
      if (dummyR == NULL)
      {
        insert(tmpCoefficient, tmpExponent, dummyRPrev);
      }

      dummyR = R;
      dummyB = dummyB->Next;
    }
    dummyB = B->Next;
    dummyA = dummyA->Next;
  }

  return R;
}

The total running time is $O(M*N)$. We start from the inner most loop. We go through the result linkedList to search for the duplicate exponent term. The running time depends on the length of the linkedList. The result linkedList can have at most $M*N$ terms. Then, for the middle loop, we iterate through $N$ times and for the outer most loop, we iterate through $M$ times. So, the total running time is $O(M*N*MN) = O(M^2N^2)$.

Question 2

We can certainly do better than $O(M^2N^2)$.

Polynomial
multiply2(Polynomial A, Polynomial B)
{
  int lenA = 0, lenB = 0;
  PtrToNode dummyA = A->Next;
  PtrToNode dummyB = B->Next;
  Polynomial R = malloc(sizeof(struct Node));
  PtrToNode dummyTmp, dummyShort, dummyLong, Long;
  Polynomial Tmp = malloc(sizeof(struct Node));

  while(dummyA != NULL)
  {
    lenA++;
    dummyA = dummyA->Next;
  }

  while(dummyB != NULL)
  {
    lenB++;
    dummyB = dummyB->Next;
  }

  if (lenA < lenB)
  {
    dummyShort = A->Next;
    dummyLong = B->Next;
    Long = B;
  }
  else
  {
    dummyShort = B->Next;
    dummyLong = A->Next;
    Long = A;
  }

  while(dummyShort != NULL)
  {
    dummyTmp = Tmp;
    while(dummyLong != NULL)
    {
      int coefficient = dummyShort->coefficient * dummyLong->coefficient;
      int exponent = dummyShort->exponent + dummyLong->exponent;
      insert(coefficient, exponent, dummyTmp);
      dummyTmp = dummyTmp->Next;
      dummyLong = dummyLong->Next;
    }
    R = add(R, Tmp);
    dummyLong = Long->Next;
    deletePolynomial(Tmp);
    dummyShort = dummyShort->Next;
  }

  return R;
}

Suppose polynomials $A$ has $M$ terms, and polynomials $B$ has $N$ terms. $M < N$. Instead of updating the result after each multiply, we multiply one term from $A$ (the polynomials with fewer terms) by all the terms from $B$ (the polynomials with more terms). Then we add this with the output linkedList using Polynomial add(...) function I implemented (can be found under polynomial.c). The add function has a runtime $O(max(M,N))$ and thus we can get our runtime for multiply2:

\begin{equation*} O(max(N,0)) + O(max(N,N)) + O(max(N,2N)) + ... + O(max(N, N(M-1))) = O(M^2N) \end{equation*}

Also, we calculate the length of $A$ taking $O(M)$; we calculate the length of $B$ taking $O(N)$; and we do deleteList during the while loop taking $O(MN)$. So, the total runtime is:

\begin{equation*} O(M^2 N) + O(M) + O(N) + O(MN) = O(M^2 N) \end{equation*}

Note

For this implementation, I kind of using an interface within the function. The logic begins with while (dummyShort != NULL) are the same for both $M<N$ and $M>N$. So, there is potential to write the same logic twice for these two cases respectively. The solution I use is to provide an interface using dummyLong and dummyShort variables.

Please note we need to multiply one term from the polynomials with fewer terms by all the terms from the polynomial with more terms. If we do the other way around, the runtime will be $O(MN^2)$.

Question 3 & 4

I haven't coded up for question 3 because I want to wait for finishing sorting chapter. However, I can see how we can get $O(MNlog(MN))$. This solution is very similar to Question 1. We first multiply all terms out using $O(MN)$. Then, we sort resulting $MN$ terms by exponent. Then, we run through the linked list merging any summing any terms with the same exponent (which will be contiguous). The sort takes $O(MNlog(MN))$ time. The multipies and the merging of duplicates can be performed in $O(MN)$ time. So, we have:

\begin{equation*} O(MN) + O(MNlog(MN)) + O(MN) = O(MNlog(MN)) \end{equation*}

When we actually compare the runtime of three solutions, we can see 1st one is the worst among the three. However, for 2nd one and 3rd one, the comparison result depends on the size of $M$ and $N$. If $M$ and $N$ are close in size, then $O(MNlog(MN))\approx O(MNlog(M^2))=O(MNlog(M))$, which is better than $O(M^2N)$. However, if $M$ is very small in comparison to $N$, then $M$ is less than $log(MN)$ and in this case, 2nd one is better than 3rd one.

Pelican Hack Day

2016-12-17T22:44:00+08:00

I have been using Sphinx since 2012 and I spend quite a amount of time to customize my old Sphinx-based websites (This article revisits all my past website construction effort). However, most of the time I'm tweaking the CSS and content organization of the site. I never get my hands on a serious template customization. The reason is quite simple, I have limited knowledge how Sphinx interact with Jinja template engine and Jinja language itself just looks really bizzare to me.

Now, since I start a new blog, I decide to give Jinja a chance and customize my archive page a little bit.

Here is what I want my archive page to look like:

Don't display post content. Only the title itself.

Display archives by year and archives by tags within the same page at the same time.

Display the number of posts for each year, and for each tag.

Show the time only in "month.day.year". I don't need the hours and minutes.

First Iteration

If you have read about Creating themes section in Pelican doc, you will see that we have to work with archives.html. Pelican will use the layout specified in this file to generate our archive page.

For the first iteration, my archives.html looks something like this

 {% extends "base.html" %}
 {% block content %}
 <section id="content" class="body">
 <h1>Archives for {{ SITENAME }}</h1>

 {# based on http://stackoverflow.com/questions/12764291/jinja2-group-by-month-year #}

 {% for year, year_group in dates|groupby('date.year')|reverse %}
 {% for month, month_group in year_group|groupby('date.month')|reverse %}
     <h4 class="date">{{ (month_group|first).date|strftime('%b %Y') }}</h4>
     <div class="post archives">
     <ul>
         {% for article in month_group %}
             <li><a href="{{ SITEURL }}/{{ article.url }}">{{ article.title }}</a></li>
         {% endfor %}
     </ul>
     </div>
 {% endfor %}
 {% endfor %}
 </section>
 {% endblock %}

Let's first take a look at what archive page we can get from this code.

Line[1],[2] illustrates how usually template file get organized. Usually, we create a basic html file that specifies the layout of our site, which is base.html in my case. Then, we want to extends this basic html to tailor to different needs. Inside base.html, we will place a placeholder, which will be replaced by the content of each child html page:

{% block content %}
{% endblock %}

In my case, I extends base.html to make an archive page. The content enclosed between {% block content %} and {% endblock %} will replace the placeholder inside base.html.

Line[4] {{ SITENAME }} is very similar to shell expansion. We will expand the variable SITENAME with its content. SITENAME is the same variable we specify in pelicanconf.py and the expanded result will be the value we assign to SITENAME variable in config file. In my case, the expansion result will be "Tech Stuff".

Starts from Line[8], things start to get interesting:

{% for year, year_group in dates|groupby('date.year')|reverse %}
...
{% endfor %}

Jinja itself is based on Python. So, we can borrow some knowledge from our Python realm. As you can tell, {% for ... %} ... {% endfor %} is what for loop looks like in Jinja.

dates itself is a list of articles ordered by date, with each element is an article object. Here is what dates looks like in my mind:

dates = [ article1, article2, article2, ... ]

and each article looks like:

article = [ title, summary, author, date, ... ]

Let's put the following code in our archives.html to better understand the structure of dates:

{% for year in dates %}
<h1>{{ year }}</h4>
{% endfor %}

The output looks like:

/Users/zeyuan/Documents/projects/linuxjedi.co.uk/content/blog/2016/12/17/pelican-hack.rst
/Users/zeyuan/Documents/projects/linuxjedi.co.uk/content/blog/2016/12/16/portability.rst
/Users/zeyuan/Documents/projects/linuxjedi.co.uk/content/blog/2016/12/03/maw-003.rst
/Users/zeyuan/Documents/projects/linuxjedi.co.uk/content/blog/2016/11/28/maw-002.rst
...

Note

I would highly recommend to read through the Creating themes section in Pelican doc page, they describe those objects in word.

groupby is a Jinja filter which can group a sequence of objects by a common attribute In our case, we want to group the info based on year. In other words, article with the same year should be in the same group. Let's experiment with the following code:

{% for year, year_group in dates|groupby('date.year') %}
    <h1>{{ year }} {{ year_group }}</h4>
{% endfor %}

The output looks like:

2015 []
2016 [, , , , , , ]

Then, we apply reverse filter to make 2016 on top of 2015. The reset of the code shouldn't be hard to decode.

Note

| is pipe, which is used to separate filters. It works like pipe in shell.

Count posts

This is what my current archive page layout looks like:

{% extends "base.html" %}
{% block content %}
<section id="content" class="body">
<h1>Archives for {{ SITENAME }}</h1>

<p>
<h2>Archives by year</h2>

{% for year, numposts in articles|groupby('date.year') %}
<li><a href="{{ SITEURL }}/archives/{{ year }}/period_archives.html">{{ year }} ({{ numposts|count }})</a></li>
{% endfor %}
</p>

<p>
<h2>Archives by tag</h2>

{% for tag, articles in tags %}
<li><a href="{{ SITEURL }}/tag/{{ tag }}.html">{{ tag }} ({{ articles|count }})</a></li>
{% endfor %}
</p>
</section>
{% endblock %}

If you understand previous sections, this code chunk should have no problem to you. I should point out that count is the filter we use to count the number of articles.

The rest

For "Archive by year", I use another template "period_archives.html" to specify the layout. It looks pretty straightforward. However, there is a problem takes me a while to figure out:

When I click on certain year, I jump to the archive page for that year. In that year, I want to have the page display "Archives for 2016". "2016" can be replaced based on the year I actually click initially. This leads to a problem to me: how do I know which year the user click? In other words, how do I pass the information to "period_archives.html"?

I couldn't find a nice way to solve this problem. Here is what I do:

{% for year, null in dates|groupby('date.year') %}
    <h1>Archives for {{ year }}</h1>
{% endfor %}

Since each articles under a certain year archive should have the same year value, I need to take a look at one of them to find out the year value and put the value to the heading. However, I don't have to do this trick for tag. I can somehow magically reference the value:

<h1>Archives by tag '{{ tag }}'</h1>

Last point I want to point out is that you can define your own Jinja filter under pelicanconf.py.

Lesson Learned: Portability

2016-12-16T23:20:00+08:00

Portability is a kind of issue that people always talk about in software engineering field. I never have been through such problem on my own probably because I don't have to port my stuff into different platforms. However, this is not the case anymore during the work.

Recently, I revisit the first task I owned when I joined the team, which is to develop a lightweight configuration tool to improve product usability. Lightweight is the key of this task as we originally have a Java-based GUI setup tool involving lots of point & click. This solution is fairly unpopular among our customers mainly because the program itself takes lots of space for DB2 image and it doesn't fit well with his peers, which all are scripts that can be executed directly from shell.

So, in my iteration, I decide to follow the format of majority of utility tools in DB2 image - using scripting language. The language I choose is, unfortunately, Shell. The whole task goes amazingly well. With the help of my tool, product configuration time is reduced by 75%. Everyone in my team loves it until someone decides to run it on AIX.

The environment I develop the tool is SUSE with ksh installed. The AIX that my colleague tries to test my tool on also has ksh configured but there are some quirky behavior difference on different platform.

For instance, when I try to split an array, say tmp2 with delimiter :, the following code works great on SUSE:

saveIFS=$IFS
IFS=":"
local tmp2=($tmp) # split tmp with ":" and stored into tmp2 as array
IFS=$saveIFS

However, on AIX, only the following way will work:

#!/bin/sh
tmp=a:b:c:d
saveIFS=$IFS
IFS=":"
local tmp2
n=0
for i in $tmp; do tmp2[$n]=$i; ((n=n+1)); done
IFS=$saveIFS
echo ${tmp2[0]}
echo ${tmp2[1]}
echo ${tmp2[2]}
echo ${tmp2[3]}

As you can see, I need a for loop to split the array on AIX.

For another example, when I try to increment counter inside a loop, on SUSE, I can do ((n++)) but on AIX, I need to do ((n=n+1)).

This makes me realize why most of our development scripts (i.e. to help build the source code) use perl instead of shell. I have to rewrite the whole script in Perl.

This is a very important lesson for a fresh college graduate by that time.

Reverse Singly Linked List

2016-12-03T20:34:00+08:00

Problem

This problem is MAW 3.12:

Write a nonrecursive procedure to reverse a singly linked list in O(N) time.

Write a procedure to reverse a singly linked list in O(N) time using constant extra space.

Solution

Essentially, this is just one problem: reverse a singly linked list with various constraints. There are a couple of ways doing so. All of them satisfy 3.12.a and 3.12.b

Note

Solution 2 & 3 are probably most people will expect, particularly during an interview.

Solution 1

 List
 reverseList(List L)
 {
   Pos dummyL = L->Next;
   List R = malloc(sizeof(struct Node));

   while (dummyL != NULL)
   {
     Pos tmpNode = malloc(sizeof(struct Node));
     tmpNode->Element = dummyL->Element;
     tmpNode->Next = R->Next;
     R->Next = tmpNode;
     dummyL = dummyL->Next;
   }
   return R;
 }

Solution 1 is pretty straightforward. We first create a new list. Then, we walk through the original list and insert node we visit at the very beginning of the new list. Once we finish the traversal of the original list, we return the new list.

Note

You can use a stack to reverse the list. This will require O(N) extra space.

This solution shows one of the reasons why we use a header node or dummy node in our linked list implementation (instead of just use a pointer directly pointing towards the first element in the list):

Without the dummy node, there is no really obvious way to insert at the front of the list.

This can be seen from Line[12]. Also, this routine has a return type List instead of void.

Note

The definition for using or not using dummy node is the same. However, implementation difference can be seen by observing how the program construct a list: in my case, initializeList.

However, this solution wastes a ton of memory space and too many malloc operations, which basically duplicate the data. This is the place where the algorithm can be improved.

 List
 reverseList(List L)
 {
   Pos dummyL = L->Next;
   List R = malloc(sizeof(struct Node));

   while (dummyL != NULL)
   {
     // Remove element from old list.
     Pos tmpNode = dummyL;
     dummyL = dummyL->Next;

     // Insert element in new list.
     tmpNode->Next = R->Next;
     R->Next = tmpNode;
   }
   return R;
 }

Note

This solution has two interesting points:

It's obvious that it's correct: there are no corner cases to worry about and both two-line operations are familiar to anyone who's manipulated a linked list.
It's pretty much identical to the Solution 2 (same number of temporary variables, same assignments in slightly different order).

Solution 2

 void
 reverseListIterative(List L)
 {
   Pos dummyCurrent = L->Next,
       dummyPrev = NULL,
       dummyNext;

   while (dummyCurrent != NULL)
   {
     dummyNext = dummyCurrent->Next;
     dummyCurrent->Next = dummyPrev;
     dummyPrev = dummyCurrent;
     dummyCurrent = dummyNext;
   }
   L->Next = dummyPrev;
 }

The 2nd solution is an iterative approach. The logic itself is quite straightforward. But, please always remember we assume dummy node exists. You can see both from Line[4] and Line[15].

Note

This actually not the solution I come up initially. My initial implementation works but is not as nice as this one. You can check it out in my linkedList.c

Solution 3

 static List P;
 static void
 reverseListRecursiveHelper(List L)
 {
   if (L->Next == NULL)
   {
     P = L;
     return;
   }
   reverseListRecursiveHelper(L->Next);
   L->Next->Next = L;
   L->Next = NULL;
 }

 void
 reverseListRecursive(List L)
 {
   reverseListRecursiveHelper(L->Next);
   L->Next = P;
 }

This solution is a recursive solution. This causes me much time to think about because we have a dummy node to be taken care of. That's why I use a private helper function. There is a couple important points to be noticed here:

Use a static List variable P is necessary because we need to keep track of where is our first node after reverse (i.e. the last node in the original list will become the first node after reversal). This is important because without P, we cannot access the first node because all the links are reversed and we can no longer traverse the list from our dummy node.
Inside reverseListRecursiveHelper, I don't have to check if L is NULL (You need to do this for no dummy node implementation style). Essentially, this is the base case where I got passed in an empty list. Since in our implementation, dummy node always exists even when the list is empty (check out deleteList routine), L->Next is always valid (we don't want to reference L, which is NULL already).

We use a private function mainly because we have dummy node in our implementation. This is a special case that cannot be handled inside the recusive call. That's also why the first data node in the original list is passed into the helper function.

List P;
void
reverseListRecursive(List L)
{
  // empty list base case
  if (L->Next == NULL)
  {
    return;
  }
  // only one node (tail node) base case
  if (L->Next->Next == NULL)
  {
    P = L->Next;
    return;
  }
  reverseListRecursive(L->Next->Next);
  L->Next->Next->Next = L->Next;
  L->Next->Next = NULL;
  L->Next = P;
 }

The above code shows a perfect example why dummy node case cannot be handled in recursive call. This is because, when we do recursion, we always assume there is dummy node exists in the sub list we passed in. However, that is not what our list acutally is. You can see why our recursion assumes the dummy node exists by reading Line[6] & Line[11] & Line[16].

PrintLots

2016-11-28T18:20:00+08:00

Problem

Today, I finished the problem 3.2. The question is following:

You are given a linked list, L, and another linked list, P, containing integers sorted in ascending order. The operation PrintLots(L,P) will print the elements in L that are in positions specified by P. For instance, if P = 1,3,4,6, the first, third, fourth, and sixth elements in L are printed. Write the procedure PrintLots(L,P). You should use the basic list operations. What is the running time of your procedure?

Solution

 void
 printLots(List L, List P)
 {
   Pos dummyP, dummyL; // creates dummy nodes to traverse the list
   int i = 0, idx, outofelement;

   dummyP = P->Next;
   dummyL = L->Next;

   while (dummyP != NULL)
   {
     idx = dummyP->Element;
     if (idx >= 0)
     {
       // if the idx is larger or equal to where the dummyL currently is
       // we don't want to reset the dummyL to the very beginning of
       // the list L again to redo the traverse.
       if (idx < i)
       {
         dummyL = L->Next;
         i = 0;
       }
       for(; i < idx; i++)
       {
         if (dummyL->Next != NULL)
         {
           dummyL = dummyL->Next;
         }
         else
         {
           outofelement = 1;
           break;
         }
       }
       if (outofelement == 1)
       {
         printf("No element in position %d, ", idx);
       }
       else{
         printf("%d, ", dummyL->Element);
       }
     }
     else
     {
       exit(EXIT_FAILURE);
     }
     outofelement = 0;
     dummyP = dummyP->Next;
   }
 }

The problem isn't hard to solve. However, to get things right, I need to develop several test cases. Let's develop a solution that can handle more general situation. In other words, linked list, P, doesn't necessarily contain integers sorted in ascending order. Here are test cases I developed:

L: 23, 44, 45, 57, 89, -1

P:  1, 3, 4, 5          <--- normal case
    1, 3, 4, 6          <--- there is no sixth element in L
    1, 3, 4, 6, 7       <--- there is no sixth, seventh element in L
    6, 7, 3, 1          <--- there is no sixth, seventh element in L, but have third, first element
    6, 2, 7, 1          <--- a no element (6th) followed by a existing element (2nd)
   -9, 1, 3, 4          <--- negative integer from P appears at the beginning
    1, 2, 4, -10        <--- negative integer from P appears at the end

The code presented above handles all these different situations. In addition, if the integers presented in P are actually in ascending order, we want to take advantage of this piece of information. That's why we check if (idx < i). We don't want to reset the traverse ptr (i.e. dummyL) every single time. In other words, if the number in P is actually ascending, we want to move the traver ptr from its current pos instead of reset.

Automatically publish Tinkerer bld output to GitHub with Travis CI

2016-11-27T22:00:00+08:00

Perface

I saw a comment from a web that talks about auto deployment with Travis CI

As an aside, you can also use GitHub Pages for hosting, which is free, and then integrate it with Travis-CI to automatically publish the blog (basically run pelican to generate the output and push the changes back online) in order to decouple the actual writing of blog posts from the publishing part.

The above also has the advantage of enabling a history of changes done (both for the articles themselves and the output), as well as simplifying things if you want to have guest posts and so on.

That's the place where I start to explore Travis CI.

Travis CI

Travis CI part isn't hard to figure out. I referenced the following articles to get me started with this great tool, particularly with Sphinx-doc:

learn-travis

Sphinx-doc repo .travis.yml

Have Travis-CI test your Sphinx docs

The basic idea of Travis CI is quite simple. Once you commit something, it will trigger Travis CI to clone your repository, and run the command you specified in .travis.yml and then it will tell you the result of this commit (i.e. Whether you pass all the test specified in .travis.yml)

Work with Tinkerer

Note

Tinkerer is built upon Sphinx-doc. Any Sphinx-doc-ish tool should have similar setup when work with Travis CI.

The setup for me is that I don't use gh-pages. Instead, I directly use master branch as the source for my github page. The reason is that Tinkerer will generate index.html directly inside root directory of the repo, which will redirect the visit to index.html under blog. blog is the default output directory.

Here are the tutorials I referenced. However, all of them talk about working with gh-pages:

Auto-deploying built products to gh-pages with Travis

Automatically Publish Javadoc to GitHub Pages with Travis CI

The first link above offers a framework of how you should get everything working and the second link's bottom script offers some intuition.

I'm not going to redo the work. I just want to point out something you need to be careful:

DO NOT use personal token. As mentioned by the first link, using a GitHub personal access token offers the full access to all your git repo. That's a very high risk.
Be Careful with Public/Private. You need to use the Travis client to encrypt the private ssh key and upload the corresponding public ssh key to your repository.
Don't put passphrase for your ssh key. If you do, Travis CI will ask for the passphrase during the automation process, which will lead to build hang. If this happens, regenerate the ssh key.
Be careful only upload your .enc file. Don't upload your ssh private key to your repo.

Decode the script

.travis.yml

This is my .travis.yml:

language: python
python:
  - "2.7"

install:
  - pip install tinkerer
  - pip install sphinxjp.themes.tinkerturquoise

script:
  - tinker -b

env:
  global:
  - ENCRYPTION_LABEL: "8c1ec1f6b778"
  - COMMIT_AUTHOR_EMAIL: "ferrishu3886@gmail.com"

after_success:
  - bash ./deploy.sh

notifications:
  email:
    recipients:
      - ferrishu3886@gmail.com
    on_success: change # option [alway|never|change]
    on_failure: always

install section asks Travis CI to install the necessary packages to build our doc.
script section contains our doc build command.
env section contains environment variables required for our deploy.sh. They are used to authorize a user on Travis CI to make git clone, git push, etc.
after_success tells Travis CI what to do once the script section is done successfully.
notifications customize the email notification.

deploy.sh

For deploy.sh is easy to understand if you take a look at the Travis CI log for a build.

Travis CI first perform basic the environment setup. Then, it clones the git repository. Next, it builds our doc. If the build is success, it executes our deploy.sh.

Inside deploy.sh, the main idea is to first clone the same repo (i.e. travis-dup) and copy the bld output pages (under /xxks-kkk/blog/blog) to the bld directory of the same repo we just cloned (i.e. travis-dup/blog). If there is nothing changed in the bld output pages, we exit. Else, we commit the changes and use the authencation we just added (i.e. ssh-add travis) and push the change to the repo.

To keep it simpler, you can imagine Travis CI is a remote server that you can do anything you want. Thus, we can let bld result to be pushed to our repo by asking user (i.e. travis) from the remote server to do so.

Generate a Linked List from a given array

2016-11-27T19:38:00+08:00

Perface

Well, I'm starting to work through Data Structures and Algorithm Analysis in C (2nd edition) (referenced as MAW in the following posts) a couple of months agao to serve several purposes:

to get enough familarity with C programming language
to keep my computer science foundation knowledge fresh
I'm interested in System-level programming and mastering C and C++ is a must.

I work on DB2 codebase but I don't play around the material I mentioned above a lot. Things can get rusty pretty quickly. So, I need a way to keep fresh.

Important

All the source code relates to this book can be found on my git repo

Solution

For completeness and readability, here is my basic node declaraiton and definition.

typedef int ET; // ET shorts for "ElementType"

// we always assume there is a dummy node at the very beginning
// of the list.
#ifndef _LINKED_LIST_H
#define _LINKED_LIST_H

struct Node;
typedef struct Node *PtrToNode;
typedef PtrToNode List;
typedef PtrToNode Pos;

#endif

// placed in the implementation file
struct Node
{
  ET Element;
  Pos Next;
};

When I try to work through the linked list related questions in Chapter 2, the first thing I need to do is to able to verify my solution. I need to figure out a way to quickly generate a test linked list. So, that's what List initializeList(ET A[], int arrayLen); for.

static List
initializeNoHeaderList(ET A[], int arrayLen)
{
  int i = 0;
  Pos tmpNode;;

  if (arrayLen == 0)
  {
    return NULL;
  }
  tmpNode = malloc(sizeof(struct Node));
  if (tmpNode == NULL)
  {
    exit(EXIT_FAILURE);
  }
  tmpNode->Element = A[i];
  tmpNode->Next = initializeNoHeaderList(A+1, arrayLen-1);
  return tmpNode;
}

List
initializeList(ET A[], int arrayLen)
{
  Pos header;

  header = malloc(sizeof(struct Node));
  if (header == NULL)
  {
    exit(EXIT_FAILURE);
  }
  header->Next = initializeNoHeaderList(A, arrayLen);
  return header;
}

initializeList adds a dummy node and invokes initializeNoHeaderList to actually generate linked list from a given array. Inside initializeNoHeaderList, we use recursion to generate the list from array.

Note

If we actually change tmpNode->Next = initializeNoHeaderList(A+1, arrayLen-1); to tmpNode->Next = initializeList(A+1, arrayLen-1);, this can lead to a list contains nodes alternate between actual data node and the dummy node. (i.e. ET test_arr[] = {23, 44, 45, 57, 89, -1}; then the generated linked list will be 23->0->44->0->45->0->57->0->89->0->-1->0->)

Hello World

2016-11-23T23:00:00+08:00

This blog will focus entirely on posts that involve either source code or mathematical expression. If you are looking for my posts about life reflection, book review, and many other non-technical posts, please check out my blog on wordpress.

Minimal Emacs Tutorial

2015-10-18T16:18:00+08:00

Learn about Emacs

Here I will cover some basic manipulation with text files using emacs. It should be enough to get started working with emacs.

Terms in Emacs

Region: the highlighted area
Kill: same as "cut"
Yank: same as "paste"

Emacs Key Notation

Prefix	Meaning
C-	(press and hold) the Control key
M-	the Meta key (the Alt key, on most keyboards)
S-	the Shift key (ie. `S-TAB` means Shift Tab)
DEL	the Backspace key (not the Delete key).
RET	the Return or Enter key
SPC	the Space bar key
ESC	the Escape key
TAB	the TAB key
ARR	the arrow keys

Common Usage

System operation

C-g keyboard-quit; cancels anything Emacs is executing. If you press any key sequence wrongly, C-g to cancel that incorrectly pressed key sequence and start again.
C-x C-c close emacs
C-x b Open a promt to enter a buffer name
C-h f Describe a function (i.e., C-h f electric-indent-mode, C-h f fboundp)
C-x ARR quickly switch between buffers

File Editing

Note

You need to set mark before you can use region operation. To know more about The Mark and Region
To move or copy a region of text in emacs, you must first "mark" it, then kill or copy the marked text, move the cu rsor to the desired location, and restore the killed or copied text. A region of text is defined by marking one end of it, then moving the cursor to the other end.

C-@ Set the mark here
C-SPC Set the mark where point is
C-x-h Select the whole text
C-w kill the region
M-d kill forward to the end of the next word (kill-word)
C-y yank the region
M-w copy the region
C-k kill the whole line (note you need to put the cursor at the very beginning of the line)

Note

To copy text, kill it, yank it back immediately (so it's as if you haven't killed it, except it's now in the kill ring ), move elsewhere and yank it back again.

C-x C-s save file
C-x C-v RET reload a file (alternative way is M-x revert-buffer)
C-/ (C-x u) undo
C-r invoke backward search (type search word thereafter. Use C-r to repeatedly travel through the matches backward)
C-s similar to C-r but search forward
C-x r t insert words to multiple lines highlighted (the same thing you typed will be entered on all the lines you've selected)
M-x clipboard-yank paste the clipboard text to emacs (useful when using emacs GUI)
M-x clipboard-kill-region paste emacs text to clipboard

Cursor Movement

ESC-< go to the beginning of the file
ESC-a go to beginning of the sentence
ESC-e go to end of the sentence
C-a go to beginning of the line
C-e go to the end of the line
M-x goto-line go to the line specified
C-e RET simulate o in vi
C-a RET simulate O in vi
C-Up go to the cursor location before a chunk of test pasted
C-v page down
M-v page up

Searching and Replacing

ESC-% (query-replace) - ask before replacing each OLD STRING with NEW STRING.
- Type y to replace this one and go to the next one
- Type n to skip to next without replacing
- Type ! to replace this one and remaining replacements without asking
- See more options in GNU manual
Esc-x replace-string replace all occurrences of OLD STRING with NEW STRING.
ESC-x list-matching-lines lists all the lines matching your pattern in a separate buffer, along with their numbers. Use "ESC-x goto-line" to go to the occurrence you're interested in.

Manage Split Windows

C-x 2 split-window-below
C-x 3 split-window-right
C-x 1 delete-other-windows (unsplit all)
C-x 0 delete-window (remove current pane)
C-x o other-window (cycles among the opening buffers)

File Management (dired mode)

M-x dired start view directory
^ go to parent dir
g refresh dir listing
q Quit dired mode (buffer still exists)
RET Open the file or directory (this will open with another buffer). If you want to stick with one buffer, use a.
o Open file in another window (move cursor to that window as well)
C-o Open file in another window but stay on dired buffer
+ create new dir
C-x C-f Create a new file (yes, the command is the same as opening a new file in non-dired mode)

Other

M-x whitespace-mode allows you to explicitly see white-space, tab, newline. Especially useful when work with python.
M-x sort-lines allows you to sort the marked region alphabetically. Especially useful when work with lots of Java import or C #include
C-x l count number of the lines for the file; give the current line number; list how many lines left.

HowTos

Parent shell

When running Emacs in a terminal, you can press C-z, type the shell command and then resume Emacs with fg

How can I get Emacs to reload all my definitions that I have updated in .emacs without restarting Emacs?

You can use the command load-file (M-x load-file, then press return twice to accept the default filename, which is the current file being edited).

You can also just move the point to the end of any sexp and press C-x C-e to execute just that sexp. Usually it' s not necessary to reload the whole file if you're just changing a line or two.

M-x eval-buffer immediately evaluates all code in the buffer, its the quickest method, if your .emacs is idempotent.

You can usually just re-evaluate the changed region. Mark the region of ~/.emacs that you've changed, and then use M-x eval-region RET. This is often safer than re-evaluating the entire file since it's easy to write a .emacs file that doesn't work quite right after being loaded twice.

Shift multiple lines with TAB

Select multiply lines, then type C-u 8 C-x Tab, it will indent the region by 8 spaces. C-u -4 C-x Tab will un-indent by 4 spaces.

Switch between windows when one windows open with term

If you open two windows, and one window open a term (ie. M-x term), now you want to switch back to another window. You may find out "C-x o" may no longer work. In this case, you may want to use C-c o to switch to next window from term

Comment out multiple region

Comment out multiple lines. Highlight the region and then M-x comment-region. To undo the comment, M-x uncomment-region

Error during download request: Not Found

Happened when you try to install a package (M-x package-install). M-x package-refresh-contents to rescue.

Editing multiple lines at the same time

suppose I have the following chunk of code that I want to edit:

printf "%s=%s\n" "Database" "bool_db"
printf "%s=%s\n" "Username"  "admin"
printf "%s=%s\n" "Password"  "password"
printf "%s=%s\n" "ReadOnly"  "false"
printf "%s=%s\n" "ShowSystemTables" "false"
printf "%s=%s\n" "LegacySQLTables" "false"
printf "%s=%s\n" "LoginTimeout" "0"

and I want to remove all printf "%s=%s\n" in each line. I can do the following:

Mark the beginning of the region and invoke M-x rectangle-mark-mode (or C-x SPC) and select all the printf "%s=%s\n"
Delete them by M-x kill-region (or C-x r k)

Note

Instead of delete, you can use C-x r t string RET to replace rectangle contents with string con each line.

Turn on the line number on the left hand side

I find this is particularly useful when I work with gdb in emacs. It can be done with M-x linum-mode.

Resources

Personally reference them a lot. But there are ton online through google.

Emacs Configuration

This is my personal emacs configuration.

algorithm	worst-case time
quick-find	\(MN\)
quick-union	\(MN\)
smart union	\(N + M\log N\)
quick union + path compression	\(N + M\log N\)
smart union + path compression	\(N + M\log^*N\)

Type	Form	Operand Value	Name
Immediate	\($Imm\)	\(Imm\)	Immediate
Register	\(E_a\)	\(R[E_a]\)	Register addressing
Memory	\(Imm\)	\(M[Imm]\)	Direct addressing
Memory	\((E_a)\)	\(M[R[E_b]]\)	Indirect addressing
Memory	\(Imm(E_b, E_i, s)\)	\(M[Imm+R[E_b]+(R[E_i]\cdot s)]\)	Scaled indexed addressing ¹