Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gmmastros on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parallel processing 3

Status
Not open for further replies.

Chris Miller

Programmer
Oct 28, 2020
4,600
6
38
DE
Within thread184-1820648 a lot of ideas were posted about the need Griff has to process data to very many PDFs. In the end, it has become a bit too convoluted for me, and I don't hide the fact I'm embarrassed by the confusion I had and caused already, so I opted out.

But I wanted to pull together some thoughts about performance and responsiveness improvements by parallel processing. It doesn't really fit into the tek-tips categories of question or tip, it's also not really just news, but I have to pick something to differentiate this from being neither a question not a tip.

Parallel processing in either form like using multi-threading or multiple processes is something that can always be used as todays computers, as multiple CPU cores have become very normal, while VFP is still just single-threaded.

The idea of using multiple computers to process data to the many PDFs necessary was even questioned by the needs that arise from organizing this, be it splitting up data in thread184-1820648. I'm not saying that's wrong. Yes, there are usually reorganisation efforts that limit the acceleration factor to be the number of computers used or the number of CPUs or CPU cores. But usually, you get a factor of much more than 1 and even if you only end up with N-1 it pays to make that effort.

I stated, having made the first experiments by setting process affinities manually with the task manager, that this worked out as I expected it. Then I looked into doing the same thing programmatically and posted this code:
Code:
Declare INTEGER GetCurrentProcess In Kernel32

Declare INTEGER GetProcessAffinityMask In Kernel32 ;
   INTEGER hProcess, STRING @lpProcessAffinityMask, STRING @lpSystemAffinityMask

Declare INTEGER SetProcessAffinityMask In Kernel32 ;
  INTEGER hProcess, INTEGER dwProcessAffinityMask
  
Declare Integer GetLastError in WIN32API
     
Local lnProcessHandle, lcPA, lcSA, lnPA, lnSA, lnCPU
lnProcessHandle = GetCurrentProcess()
lcPA = Space(2)
lcSA = Space(2)

Clear
If GetProcessAffinityMask(lnProcessHandle,@lcPA, @lcSA) = 1
   ? 'Affinity Masks (Process, System):'
   ? CreateBinary(lcPA), CreateBinary(lcSA)
   
   * Translate System affinity mask string to number
   * It will have bits set for all available cores, i.e. for 4 cores it will be 0h0F00 = 15
   lnSA = CToBin(lcSA,"2RS")
   lnCPU = 2 && 0..3 for 4 cores
   lnPA = Bitand(Bitset(0,lnCPU),lnSA) && Bitand with lnSA ensures no unavailable CPU core bit is set
   * Could also set multiple CPUs by setting multiple bits, i.e. Affinity Mask=15 would mean no specific CPU core affinity.
   
   If lnPA>0 and SetProcessAffinityMask(lnProcessHandle,lnPA) = 1
      ? 'OK: Process affinity set to CPU '+Alltrim(Str(lnCPU))
   Else
      If lnPA=0
         ? 'Error: CPU number not available according to System Affinity Mask.'
      Else 
         ? 'Error:',GetLastError()
      Endif
   EndIf
 
   If GetProcessAffinityMask(lnProcessHandle,@lcPA, @lcSA) = 1
      ? 'Affinity Masks (Process, System):'
      ? CreateBinary(lcPA), CreateBinary(lcSA)
   EndIf
EndIf

And this works out okay, too.

So I think this is the next logical step on top of using multiple processes and relying on the OS to balance the load of the single cores. This ability to set CPU core affinities also can be merged into what I posted about multiprocessing in thread184-1820019.

I once more looked into ParallelFox, which is a library by far more advanced in features and more mature on that topic. I don't find usage of processor affinity in ParallelFox source code. That's what you could file under "news".

Maybe Joel Leach, the project manager and only contributor I see on already did test but found out it's not advancing the effectiveness of the worker concept. The only aspect related to the CPU core number I find in the source code is that ParallelFox does take the number of CPU cores into account as default number of workers. It's sensible to think that the multiple processes will make balanced use of the cores in themselves without manipulating CPU affinities. By default, any process, also VFP executable processes, will have no specific affinity. This means their affinity mask has all CPU core bits set on and that means they are not limited to which core they use. That also can be interpreted as using any other affinity will just lower the ability of a process to make use of all CPU cores available at any time.

I expect setting affinity to one specific core to actually improve the parallelity of processing. Besides lowering the overhead of switching cores, it would simply ensure that N processes use N different cores and therefore run in parallel by definition, while the OS might assign some cores to more than 1 worker and you don't get to the full parallelity of all cores for all workers.

I already opted out of making multiprocessing with COM a new project in itself, this time I might take on the effort of modifying ParallelFox.

Chriss
 
Done, but not tested in comparison without affinity.

And before I make a fork and a pull request to Joel Leach on GitHub, below I post the changes you can apply yourself, if you're a ParallelFox user willing to test this. Some remarks, first, though:

1. I've seen Parallelfox actually does start a COM Server "ParallelFox.Application" - or in IDE/Dubugmode it uses VisualFoxPro.Application.9 (or your current version), so it's not far from what I proposed in using COM for multiprocessing.

There is a certain elegance in ParallelFox's COM Server implementation: It's actually implementing nothing at all. It mainly starts a new process and makes its _VFP system variable available to a worker, which then is used to execute in that COM server instance process.

It's done based on the idea there never is a change in the ParallelFox.Application type library. And that makes it easy to update it without registry changes and even use multiple versions. At first, I was tempted to change exactly that ParallelFox.Application class to set CPU affinity in its Init() code. But then this only covers the use case of ParallelFox.Application, not of starting VisualFoxPro.Application.9 in the IDE or in debug mode.

2. I found a good place to change, the ParPoolMgr class. That's the parallel pool manager, which starts the workers and maintains a queue of commands the workers will be processing. I would have subclassed it, but then this would need further changes, as the main class which programmers use from ParallelFox - the "parallel" class - has hard-coded the usage of the ParPoolMgr class in its Init(). So I decided to make my change in the ParPoolMgr class instead of subclassing it.

In my change, all workers the ParPoolMgr class creates are given a script to set their own CPU affinity as their first queued command. That's, in very short, the only enhancement I made.

3. Looking into the opinions of experts on CPU affinity it checks out that VFP processes by their nature to be single-threaded are a good use case for setting CPU affinity. Otherwise, Windows experts say, the Windows OS Dispatcher will assign threads to the same CPU anyway as keeping a thread on a given processor retains that thread's status information in the processor cache, improving the performance of that thread.

There is a mode of ParallelFox allowing you to start a new thread by using an MTDLL mode. That can make the best use of CPUs if a process isn't limited to one core only. I haven't yet experimented with that scenario. It means the CPU affinity should be made an optional mode.

In fact, as initially said I didn't yet compare parallel processing with and without CPU affinity assignment, I have to come up with a good case in which this will matter and also likely differ, which should be something doing heavy number crunching, not parallel working on files, which would be the bottleneck no matter whether CPU affinity is used or not.

But I'm as far as having stable code that I verified to set the affinities and now share that with you here, first:
In ParallelFox.pjx in the classes tab expand the parallelfox library and modify the parpoolmgr class. In the code editor pick the method "StartWorkers", which originally will be this:

Code:
* Start worker processes
* Same EXE is used for all workers
Lparameters lcProcedureFile, lcDirectory, llDebugMode
Local lnWorker, loWorkerProxy as WorkerProxy of ParallelFox.vcx

Debugout Time(0), Program(), lcProcedureFile, lcDirectory, llDebugMode

If This.Workers.Count > 0
	Assert .f. Message "Workers already started."
	Return
EndIf 

This.lDebugMode = llDebugMode

For lnWorker = 1 to This.nWorkerCount
	loWorkerProxy = NewObject("WorkerProxy", "ParallelFox.vcx", "", lcProcedureFile, lcDirectory, llDebugMode, lnWorker, This.lMTDLL)
	ComArray(loWorkerProxy, 11)
	This.Workers.Add(loWorkerProxy)
EndFor

Append the following code after the end:

Code:
*--------------------------------------------------------------------------------------------
* Enhancement for CPU affinity by Chriss Miller 
*--------------------------------------------------------------------------------------------

* After workers are created, let all of them set their own CPU affinity.

*--------------------------------------------------------------------------------------------
* Step 1: Determine CPU core (group) bit count
*--------------------------------------------------------------------------------------------

* For the process affinity mask we can't assume there are as many bits as nWorkerCount.
* That defaults to GetEnv("NUMBER_OF_PROCESSORS"), but could also be set higher. The number 
* of mask bits is also limited to 64, even with CPUs with more than 64 logical cores.
*
* We, therefore, need to determine the most significant bit of the system CPU affinity mask.
* Each bit will be a CPU core in the normal case, but a CPU core group, if processors have
* more than 64 logical cores.
*
* Determine CPU bit count by System CPU affinity mask bits:
Declare Integer GetCurrentProcess In Kernel32
Declare Integer GetProcessAffinityMask In Kernel32 ;
   Integer hProcess, String @lpProcessAffinityMask, String @lpSystemAffinityMask

Local lnProcessHandle, lcPA, lcSA, lnPA, lnSA, lnMSBCPU, lnCPUbitCount
lnProcessHandle = GetCurrentProcess()

lcPA = Space(4)
lcSA = Space(4)

If GetProcessAffinityMask(m.lnProcessHandle,@m.lcPA, @m.lcSA) = 1
   lnSA = CToBin(lcSA,"4RS")
Else
   lnSA = 1
Endif

For lnMSBCPU = 31 To 0 Step -1
   If Bittest(m.lnSA, m.lnMSBCPU)
      Exit
   Endif
Endfor
lnCPUbitCount = m.lnMSBCPU+1
* By the way: If no bit in the system affinity mask is set, that still results in 1 CPU bit,
* as lnMSBCPU ends as 0 if higher bits are all 0, whether the lowest bit is set or not.
* Also when GetProcessAffinityMask fails, the fallback of lnSA=1 is used.


*--------------------------------------------------------------------------------------------
* Step 2: Prepare a script making each worker set its own affinity.
*--------------------------------------------------------------------------------------------

* Uses a "superglobal variable", a property of _screen of this process, the main process...
_Screen.AddProperty("CPUbitCounter",0)
* This counter will just increase with each worker setting its own CPU affinity.
* Also in case workercount is higher than the number of cores or core groups, of course.
*
* It will ensure only valid affinity mask bits are used, as it will take count%systemaskbitcount
* and thus always be between 0 and the most significant available bit for the process 
* affinity mask.
*
* In the corner case, if more than 64 CPU cores are represented as core groups and
* more than 64 workers are created, the affinity of worker 1 and worker 65 will both be 
* CPU group 0 and so on, so each CPU group bit then is used as many times as the number 
* of cores it represents. 
*
* In short: The CPU core usage will be balanced in such cases, too.

Local lcScript
Text To lcScript Noshow
   Lparameters tnCPUbitCount

   Declare Integer GetLastError In Kernel32
   Declare Integer CreateMutex In Kernel32 Integer lnAttributes, Integer lnOwner, String @lcAppName
   Declare Integer CloseHandle In Kernel32 Integer lnMutexHandle
   Declare Integer GetCurrentProcess In Kernel32
   Declare Integer SetProcessAffinityMask In Kernel32 Integer hProcess, Integer dwProcessAffinityMask

   Local lnMutexHandle, loMainProcessScreen, lnCPUbit
   Do While .T. && Assertion: This will always have an end by Exit below.
      * Critical section: Only one process at a time should access the main process _Screen.CPUCounter
      * Ensured by using a Mutex
      lnMutexHandle = GetMutexHandle("ParallelFoxMainProcessCPUbitCounter")
      If lnMutexHandle<>0
         * work on the main process _screen exclusively
         loMainProcessScreen = _Screen.oMainprocess.oMainVFP.Eval('_Screen')
         lnCPUbit = loMainProcessScreen.CPUbitCounter % m.tnCPUbitCount
         loMainProcessScreen.CPUbitCounter = loMainProcessScreen.CPUbitCounter + 1
         
         && We're done. Allow other processes access to the main process CPUbitCounter.
         CloseHandle(lnMutexHandle) 
         Exit
      Endif
   EndDo

   Local lnProcessHandle
   lnProcessHandle = GetCurrentProcess()
   =SetProcessAffinityMask(m.lnProcessHandle,Bitset(0,m.lnCPUbit))
   
   #Define ERROR_ALREADY_EXISTS 183
   Procedure GetMutexHandle(tcLockname as String)
      Local lnMutexHandle
      lnMutexHandle = CreateMutex(0, 1, @m.tcLockname)
      If GetLastError() = ERROR_ALREADY_EXISTS && Mutex already exists
         CloseHandle(m.lnMutexHandle)
         lnMutexHandle = 0
      Endif

      Return m.lnMutexHandle
   Endproc
EndText
loParameters = This.CreateParameterObject(1,m.lnCPUbitCount)

* Script will be queued for all workers as the first thing they each do
#DEFINE ALL_WORKERS .T.
This.QueueCommand("ExecScript", lcScript,,,,loParameters, ALL_WORKERS, Newobject("Events", "ParallelFox.vcx"))
* The implementation of QueueCommand creates an entry for each worker assigned with its worker number
* So each worker executes this once and it's guaranteed all workers get their affinity set.

Now recompile the ParallelFox.exe and your workers will be assigned to the different CPU cores.

Chriss
 
I realize an error sneaked in as I only test bits 0-31 with BitTest(). That's because BitTest() itself only allows testing the 32 bits 0 to 31 and at first I didn't bother as I only have 4 cores. But the affinity masks are 64bit = 8 Bytes, so this needs a little fix also in determining the PA and SA. Finally converting the 8 byte masks into two 4 byte values and going through them is all it takes.

Edit:
It's a false alarm. For 32bit processes like VFP9 and VFP9 executables the DWORD_PTR is just a 32bit value. That's because a DWORD_PTR has 32bit in a 32bit process (double word with word meaning 16bit), in a 64bit process it has indeed 64bit. Using a SPACE(8) buffer the right 4 bytes of it remain at 4 spaces, only the lower 32bit are set. So the code works fine as is.

As the documentation in talks of a semantic change of the affinity masks when a system has more than 64 CPU cores, I just automatically deducted that a system with 64 cores is still making normal use of the mask bits and thus the masks are 64-bit values. They are only 64-bit in length in the 64-bit version of the function. See the description of the data types of Windows API functions here:
Windows Data Types said:
DWORD_PTR
An unsigned long type for pointer precision. Use when casting a pointer to a long type to perform pointer arithmetic. (Also commonly used for general 32-bit parameters that have been extended to 64 bits in 64-bit Windows.)

There is a related issue with the dotnet ProcessorCount environment variable described in that besides describing the problem of it also makes clear the length of the affinity masks differ in 32bit vs 64bit processes.

Issues said:
I2. For 32-bit processes GetProcessAffinityMask caps the returned process affinity mask to 32 bits...

So this only would need adjustment if using the 64-bit VFP9 version. The only unresolved issue is what about a system with more than 32 cores?

Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top