what is the philo problem ?

WHY ?

The Philosophers Problem, introduced by Edsger Dijkstra in 1965, is a theoretical framework used to illustrate the challenges of resource allocation, synchronization, and deadlock in multi-threaded systems. This problem is commonly used in computer science education to help students and engineers understand how to design systems that handle concurrency safely and efficiently. However, it is not used for benchmarking massively parallel computations or high-performance computing (HPC) systems, as its primary focus is on ensuring correct synchronization and avoiding deadlock, rather than on evaluating computational performance.

The obvious response might be, “That’s just stupid! We don’t need that many philosophers.” Sure, cranking up the number of philosophers could make verifying the exercise absurdly complex. But hey, that’s not a pragmatic solution, is it? WE ARE PROGRAMMERS. WE NEED CONCRETE ANSWERS! So, buckle up because I’m about to dive deep into why this is a terrible idea—both memory-wise and kernel-wise. I’ll spend the rest of my evening hammering out an article on this, so you don’t have to waste your time going down that rabbit hole.

1) The Memory Problem

To understand this issue, we need to dive into how pthread works under the hood. When a thread is created using pthread_create, this function internally calls the clone system call. The clone syscall creates a new thread as a child of the current process, similar to fork, but with more granular control over what is shared between the parent and the child process. Unlike fork, which duplicates the entire process (including memory space), clone allows the new thread to share resources like memory, file descriptors, and more with the parent process.

In the context of pthread, clone is used to create a thread that runs within the same memory space as the parent, meaning all threads in a process can access the same global memory. However, each thread needs its own stack space, which is where pthread_attr_t comes in. This structure allows you to specify attributes like the size of the stack for the new thread.

If you pass NULL to pthread_create for the pthread_attr_t parameter (as many people do, especially in simple projects), the thread will use default attributes, including a default stack size. you can see this with the function pthread_attr_getstacksize

#include <stdio.h>
#include <pthread.h>
 
int main() 
{
    pthread_attr_t attr;
    size_t stack_size;
 
    pthread_attr_init(&attr);                     // Initialize thread attributes
    pthread_attr_getstacksize(&attr, &stack_size);// Get default stack size
 
    printf("Default stack size: %zu bytes\n", stack_size);
    return 0;
}

> ./check_stack_size
Default stack size: 8388608 bytes

The default stack size is 8 MB. Now, let’s do some quick math. For 200 threads, the total memory used just for stack space would be:

t o t a l m e m or r y s i z e = 200 \times 8 M o = 1600 M o = 1, 6 G o

For a typical PC with 8 GB of RAM, with a few browser tabs open and your code editor running, you’re probably already using 4 to 5 GB of RAM. Adding another 1.6 GB just for the threads in this exercise isn’t trivial—it’s a big chunk of your available memory. That’s memory you could be using for other things, like actually running your programs, compiling code, or, you know, keeping your computer from crashing.

Now, imagine if you bump that number up to 400 or 1000 philosophers or MAX_INT because why not ? 🙃 It’s not just a bad idea, it’s a recipe for disaster. Your system could start swapping memory to disk, or worse, grind to a halt under the pressure. You’d be spending more time watching your system struggle than actually solving the problem. And that’s why we put a sensible limit on the number of philosophers so you can focus on the logic and synchronization issues without bringing your entire system to its knees.

However, because of the lazy allocation strategy, the actual physical memory used will depend on how much of the stack each thread actually utilizes. The memory is virtually allocated initially, which means it doesn’t immediately occupy physical RAM. In practice, this often results in much lower memory usage than the theoretical maximum. For example, if the threads only use 10% of their stack space on average, the actual physical memory consumption could be closer to 160 MB instead of 1.6 GB.

This lazy allocation mechanism helps mitigate the impact on memory usage, but it’s still crucial to be mindful of the potential memory requirements when creating a large number of threads especially in resource-constrained environments or when threads are expected to use significant stack space.

code snippet pthread_create

2) Thread limits

By default, the maximum number of threads you can have simultaneously on Linux is 4096. You can check the current number of threads running on your system with the following command:

ps -eo nlwp | awk '{ num_threads += $1 } END { print num_threads }'

On my machine, I’ve already got 1505 threads running. Now, I’m no math genius, but I can assure you that 1505 + MAX_INT is not under 4096 .

I KNOW ! you could increase this limit using ulimit -u, but let’s be real here: if someone’s testing more than 200 philosophers despite it being clearly marked on the correction criteria they aren’t exactly “the CPU with the highest clock speed.” They’re probably the kind of person who thinks more cores automatically means faster code, even if they don’t know what a context switch is.

3) Overhead scheduling

When you ramp up the number of threads, you’re not just increasing the complexity of your program you’re also ramping up the overhead of context switching. What’s context switching, you ask? It’s when the CPU switches from one thread to another, and it’s not free. Each switch involves the CPU saving the state of the current thread and loading the state of the next one. The more threads you have, the more often this happens, and the more time your CPU spends just shuffling between threads instead of actually doing useful work.

Cache Misses: A Hidden Cost of Context Switching

Earlier, we discussed how each thread in a program requires its own stack space, consuming significant memory, especially when many threads are created. But the memory issue doesn’t stop there. When a thread is context-switched out, the data it was working on is likely still stored in the CPU cache. However, when a new thread is switched in, this new thread might need different data that isn’t in the cache, leading to a cache miss.

A cache miss occurs when the CPU has to fetch data from the slower main memory instead of the faster cache. This can significantly slow down the execution of the new thread. The more threads you have, each with their own stack and data, the more likely it is that the data the CPU needs will not be in the cache when it switches to another thread. This increased likelihood of cache misses is a direct consequence of having a large number of threads, each with its own distinct memory footprint. I might explain it in more detail here

CPU Memory

Quantifying the Overhead

To understand just how costly context switching can be, we can calculate the overhead incurred per switch with a simple formula:

Overhead per switch = \frac{Total CPU time in kernel mode (seconds)}{Number of context switches}

with the perf tool we can have data on context switch for the process

> sudo perf stat -e sched:sched_switch ./philo 20 410 100 100 10000

perf stat: Collects performance statistics.
-e sched:sched_switch: Tracks the number of context switches (when the CPU switches between threads).

 Performance counter stats for './philo 20 410 100 100 10000':
 
     38,501,336      sched:sched_switch
 
     139.977705937 seconds time elapsed
 
     42.839705000 seconds user
     127.390391000 seconds sys

Data for 20 Philosophers:

Number of context switches: 38,501,336
CPU time in kernel mode: 127.3904 seconds
Overhead per context switch:

O v er h e a d p er s w i t c h = \frac{127.3904 seco n d s}{38 , 501 , 336} \approx 3.31 mi cr oseco n d s

Percentage of CPU time spent on context switches:

O v er h e a d p er ce n t a g e = \frac{127.3904 seco n d s}{170.2301 seco n d s} \times 100 \approx 74.85%

Data for 200 Philosophers:

Number of context switches: 73,699,698
CPU time in kernel mode: 175.8109 seconds
Overhead per context switch:

O v er h e a d p er s w i t c h = \frac{175.8109 seco n d s}{73 , 699 , 698} \approx 2.38 mi cr oseco n d s

Percentage of CPU time spent on context switches:

O v er h e a d p er ce n t a g e = \frac{175.8109 seco n d s}{227.3031 seco n d s} \times 100 \approx 77.33%

What these BIG numbers tell us is that while context switching is unavoidable, increasing the number of threads can significantly increase this overhead. With 200 philosophers, the CPU is almost entirely consumed by the overhead, leaving very little time for actual computation. This is why there’s a practical limit to the number of threads you should be running. Pushing beyond that limit doesn’t just make your program more complex it actually makes it slower, as the CPU spends more time managing threads than doing the work those threads are supposed to accomplish.

Additionally, more threads can lead to increased cache misses. When too many threads compete for limited cache space, data has to be fetched from slower memory, further slowing down your program. So, adding more threads might seem like a good idea, but beyond a certain point, it results in more time spent on management and memory access rather than actual work.

So, next time you’re tempted to throw more threads at a problem, remember: there’s a point where you’re just making the CPU’s job harder, not faster.

Conclusion

next time be kind with your CPU ☺️

if you have other idea for stupid but important article you can contact me on ̶t̶w̶i̶t̶t̶e̶r̶ x

My X

bonsthie (@barnabait) on X

My Github

bonsthie - Overview

Quartz 4

Explorer

42 -- Why the 200 philo limit