Multi-Core processors with shared memory are now commonplace. A mid range PC standard workstation is outfitted with a quad core processor, allowing it to run four independent streams of computation, simultaneously. It is then plausable that the use of parallel programming would increace the performance of an algorithm with many mutually exclusive parts, in proportion to the number of processor streams available. The reality is that on a micro scale, the communication between independent computational elements has a cost. That cost is time. The architecture of current multi-core processors was not designed to support instantaneous signal transmission between cores. It is therefore a requirement that high level algorithms lean towards the parallel computation of relatively large batches of work. Batch work and streamlined thread coordination, are thus the logical choice for an algorithm designed to use the full computational resourses of a common multi-core personal computer. Standard C using POSIX threads has been chosen for the implementation presented below. C is a high level programming language capable of supporting objects, pseudo-classes, and aggressive low-level optimization. The "Lawrence Livermore National Laboratory" tutorial was found to be extremely helpful as an introduction to POSIX Threads Programming. It is important to note that elimination of logical errors that may occur when coordinating many independent computational streams is non-trivial and requires more than an introduction tutorial to the pthread functions.
This page was designed to layout an optimal framework for the the use of POSIX threads in a Linux environment to search for a global maximal value among an astronomical number of elements. The C code below will serve as a template for the use of multi-core CPUs to solve problems where performance is a concern. The prototype application of this scheme is to isolate beyond reasonable doubt, the ten most point dense 5x5 unique Boggle boards, out of a possible 26^25 = 236,773,830,007,967,588,876,795,164,938,469,376 boards. It is unrealistic to evaluate every board, but a highly successful search algorithm will evaluate boards very fast, and in large batches.
Batch processing requires coordination. A main thread will take care of organizing the results that come in from a specified number of worker threads. The nature of the relationship between the main thread and the worker threads is asymmetrical. There can be more than one worker thread, and they will almost certainly require different amounts of time to complete their batch work, even when they begin their work simultaneously. In order to minimize the idle time of CPU cores, the results produced by the worker threads should be processed by the main thread one by one, as soon as each worker thread sends its "work complete" signal.
POSIX threads use two objects, that when used together, allow for different threads to communicate with each other based on conditions related to protected global variables. These objects are "mutexes", and "condition variables". A "mutex" locks and unlocks protection of user specified global variables and the signals relating to them, sent using a "condition variable".
The batch processing program will use two sets of thread communication objects. The first set will be used by the main thread to signal worker threads to start work on a specified round. The second set will be used by worker threads, one at a time, to signal the main thread that they are done. This setup may at first glance appear simple and straightforward, but there is one major complication attached to this back and forth signaling. Before a signal is sent out, it is required that there is a thread out there waiting to receive the signal. To eliminate the logical error attached to signal transmission between the main and worker threads the following objects and signal coordination has been used.
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // Two sets of inter-thread communication variables are defined below. Each "condition variable" requires an associated "mutex". // It is the responsibility of the programmer to ensure that the global variables connected to a "mutex" are only modified under a lock condition. // Set one. pthread_mutex_t StartWorkMutex; pthread_cond_t StartWorkCondition; // The "StartWorkCondition" sends a boolean flag to request the worker threads to "WorkOn" or terminate, and a value indicating the current "Round". Bool WorkOn = FALSE; unsigned int Round = 0; // Set two. pthread_mutex_t CompleteMutex; pthread_cond_t CompleteCondition; // The worker threads need to let the main thread know which one has just sent the "CompleteCondition" signal, so it can coordinate the correct data set. // The main thread needs to let the worker threads know when it is waiting for a signal and this is communicated using "MainThreadWaiting". Bool MainThreadWaiting = FALSE; unsigned int TheCompletedBatch = BOGUS; ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// |
A lock must be placed on the "CompleteMutex" before the main thread "broadcasts" the "StartWorkCondition" so that it can enter a "waiting" state for the "CompleteCondition" signal that will be sent out as soon as a worker thread finishes.
A lock must be placed on the "StartWorkMutex" before a worker thread "signals" the "CompleteCondition" so that it can enter a "waiting" state for the "StartWorkCondition" broadcast that will be sent out as soon as the main thread coordinates the next batch of work.
A method is required to ensure that every "CompleteCondition" signal reaches the main thread before another worker thread sends out the next "CompleteCondition" signal. In short, a lock on the "CompleteMutex" to modify "TheCompletedBatch" must be acquired only when the main thread has confirmed that it is in a waiting state. If this condition is not enforced, signals will get lost because several of them will be sent out before the main thread is waiting for a signal. The boolean variable "MainThreadWaiting" does just that.
A flow diagram of the thread communication framework is presented below:
I am posting the file "WorkerThreads.c" as a template for pthread batch processing. The current setup simulates work by using the floating point sqrt() function many times. Also, remember that I converted the C file into an HTML format, so download the untampered with file here... WorkerThreads.c.
Use the following command to compile the program in a Linux terminal...
"-lm" Includes the "math.h" library, "-O3" is a compiler optimization flag, and "-pthread" includes the multi-thread functions in "pthread.h". It is helpful to run "htop" in another terminal to monitor the usage of the CPU cores running the program.
// This program provides a layout for batch processing and coordiantion using the C's pthread constructs. // The asymmetry of this high level program arcitecture stems from all worker threads starting at the same time, but requiring different amounts of time to complete. #include <stdlib.h> #include <stdio.h> #include <math.h> #include <string.h> #include <pthread.h> typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; // To introduce a high level of unpredictability, use many more worker threads than the number of cores inside of the CPU that you intend to use. #define NUMBER_OF_WORKER_THREADS 30 #define NUMBER_OF_WORKING_ROUNDS 4 #define BOGUS 99 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // Two sets of inter-thread communication variables are defined below. Each "condition variable" requires an associated "mutex". // It is the responsibility of the programmer to ensure that the global variables connected to a "mutex" are only modified under a lock condition. // Set one. pthread_mutex_t StartWorkMutex; pthread_cond_t StartWorkCondition; // The "StartWorkCondition" sends a boolean flag to request the worker threads to "WorkOn" or terminate, and a value indicating the current "Round". Bool WorkOn = FALSE; unsigned int Round = 0; // Set two. pthread_mutex_t CompleteMutex; pthread_cond_t CompleteCondition; // The worker threads need to let the main thread know which one has just sent the "CompleteCondition" signal, so it can coordinate the correct data set. // The main thread needs to let the worker threads know when it is waiting for a signal and this is communicated using "MainThreadWaiting". Bool MainThreadWaiting = FALSE; unsigned int TheCompletedBatch = BOGUS; ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // This function represents an independent stream of computation that works on a batch of data. // For now the work is simply busy-work using the "sqrt()" function on "double" floating point numbers. void *WorkerThread(void *ThreadArgument){ unsigned int X; double BusyWork; double BusyWorkTwo; long unsigned int ThisThreadNumber = (long unsigned int)ThreadArgument; // Enter a waiting state for the "StartWorkCondition". pthread_mutex_lock(&StartWorkMutex); pthread_cond_wait(&StartWorkCondition, &StartWorkMutex); while ( TRUE ) { // It is necessary to unlock the "StartWorkMutex" before thread termination, so that the other worker threads can also terminate. if ( WorkOn ) { printf("Thread-%lu: Begin Work On Batch %d\n", ThisThreadNumber, Round); pthread_mutex_unlock(&StartWorkMutex); } else { pthread_mutex_unlock(&StartWorkMutex); pthread_exit(NULL); } // This is the location where batch processing work will be carried out. Right now it is busy-work. for ( X = 0; X < 45000000; X++ ) { BusyWork = (double)X + 50.75; BusyWork = sqrt(BusyWork); BusyWorkTwo = sqrt(BusyWork + (double)X); } // Get a lock on "CompleteMutex" and make sure that the main thread is waiting, then set "TheCompletedBatch" to "ThisThreadNumber". Set "MainThreadWaiting" to "FALSE". // If the main thread is not waiting, continue trying to get a lock on "CompleteMutex" unitl "MainThreadWaiting" is "TRUE". while ( TRUE ) { pthread_mutex_lock(&CompleteMutex); if ( MainThreadWaiting == TRUE ) { // While this thread still has a lock on the "CompleteMutex", set "MainThreadWaiting" to "FALSE", so that the next thread to maintain a lock will be the main thread. MainThreadWaiting = FALSE; break; } pthread_mutex_unlock(&CompleteMutex); } TheCompletedBatch = ThisThreadNumber; // Lock the "StartWorkMutex" before we send out the "CompleteCondition" signal. // This way, we can enter a waiting state for the next round before the main thread broadcasts the "StartWorkCondition". pthread_mutex_lock(&StartWorkMutex); printf("Thread-%lu: Completed Batch %d\n", ThisThreadNumber, Round); pthread_cond_signal(&CompleteCondition); pthread_mutex_unlock(&CompleteMutex); // Wait for the Main thread to send us the next "StartWorkCondition" broadcast. // Be sure to unlock the corresponding mutex immediately so that the other worker threads can exit their waiting state as well. pthread_cond_wait(&StartWorkCondition, &StartWorkMutex); } } int main(){ long unsigned int Identity; unsigned int X; unsigned int Y; // Initialize all of the thread related objects. pthread_t Threads[NUMBER_OF_WORKER_THREADS]; pthread_attr_t attr; pthread_mutex_init(&CompleteMutex, NULL); pthread_cond_init (&CompleteCondition, NULL); pthread_mutex_init(&StartWorkMutex, NULL); pthread_cond_init (&StartWorkCondition, NULL); /* For portability, explicitly create threads in a joinable state */ pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); // Create all of the worker threads here. for ( Identity = 0; Identity < NUMBER_OF_WORKER_THREADS; Identity++ ) pthread_create(&Threads[Identity], &attr, WorkerThread, (void *)Identity); // Allow time for the worker threads to start up and wait for the signal that is going to be sent out soon. printf("Wait for 1 second for the %d worker threads to enter a waiting state.\n\n", NUMBER_OF_WORKER_THREADS); sleep(1); // Broadcast the start work signal to all of the waiting worker threads, and recieve the complete signals one at a time. for ( X = 0; X < NUMBER_OF_WORKING_ROUNDS; X++ ) { // Lock the "StartWorkMutex" to set the global work coordination variables. pthread_mutex_lock(&StartWorkMutex); // Not a whole lot to coordinate in this framework demo. WorkOn = TRUE; Round += 1; printf("Main: Broadcast Signal To Start Batch |%d|\n", Round); // Lock the "CompleteMutex" so we can start waiting for completion before any of the worker threads finish their batch. pthread_mutex_lock(&CompleteMutex); pthread_cond_broadcast(&StartWorkCondition); pthread_mutex_unlock(&StartWorkMutex); for ( Y = 0; Y < NUMBER_OF_WORKER_THREADS; Y++ ) { // Before entering a waiting state, set "MainThreadWaiting" to "TRUE" while we still have a lock on the "CompleteMutex". // Worker threads will be waiting for this condition to be met before sending "CompleteCondition" signals. MainThreadWaiting = TRUE; pthread_cond_wait(&CompleteCondition, &CompleteMutex); printf("Main: Complete Signal Recieved From Thread-%d\n", TheCompletedBatch); // This is where partial work on the batch data coordination will happen. All of the worker threads will have to finish before we can start the next batch. } pthread_mutex_unlock(&CompleteMutex); } pthread_mutex_lock(&StartWorkMutex); // Set the GAME OVER condition. WorkOn = FALSE; printf("Main: Broadcast The Termination Signal\n"); pthread_cond_broadcast(&StartWorkCondition); pthread_mutex_unlock(&StartWorkMutex); // Wait for all threads to complete, and then join with them. for ( X = 0; X < NUMBER_OF_WORKER_THREADS; X++ ) { pthread_join(Threads[X], NULL); printf("Main: Thread[%d] Has Been Joined And Terminated.\n", X); } // Clean up and exit. pthread_attr_destroy(&attr); pthread_mutex_destroy(&CompleteMutex); pthread_cond_destroy(&CompleteCondition); pthread_mutex_destroy(&StartWorkMutex); pthread_cond_destroy(&StartWorkCondition); pthread_exit (NULL); } |