Discussion:
UI display appears sluggish in multi-threaded app
(too old to reply)
Woody
2011-06-19 07:48:54 UTC
Permalink
I have an MFC app which does very compute-intensive calculations. It
is multi-threaded, with the UI in the main thread, and several worker
threads which do the calculating. The UI thread does nothing except
report on the status of the workers. It gets this status by accessing
shared memory.

The UI thread is supposed to update some numbers on the display every
few seconds, so there is a timer to redraw the screen. With all the
threads at default priorities, what happens is that the screen update
only occurs after a long delay, 30 sec or more.

How should I arrange things so the UI thread displays when the timer
tells it to (every 3-5 sec), yet does not use any CPU cycles at other
times. Keyboard/mouse response is secondary, although if it doesn't
"cost" anything, it would be a plus.
ScottMcP [MVP]
2011-06-19 13:52:10 UTC
Permalink
Post by Woody
I have an MFC app which does very compute-intensive calculations. It
is multi-threaded, with the UI in the main thread, and several worker
threads which do the calculating. The UI thread does nothing except
report on the status of the workers. It gets this status by accessing
shared memory.
The UI thread is supposed to update some numbers on the display every
few seconds, so there is a timer to redraw the screen. With all the
threads at default priorities, what happens is that the screen update
only occurs after a long delay, 30 sec or more.
How should I arrange things so the UI thread displays when the timer
tells it to (every 3-5 sec), yet does not use any CPU cycles at other
times. Keyboard/mouse response is secondary, although if it doesn't
"cost" anything, it would be a plus.
The GUI thread does not normally use any CPU cycles except while it is
processing a message. You should not have to do anything special to
get this behavior. Your message handlers are supposed to return to MFC
quickly. Are you polling something? One suspects you are executing
some kind of long loop in the GUI thread. You would have to show how
you handle the timer and the painting for further diagnosis.
Woody
2011-06-19 17:46:52 UTC
Permalink
Are you polling something?  One suspects you are executing
some kind of long loop in the GUI thread. You would have to show how
you handle the timer and the painting for further diagnosis.
UI thread does SetTimer(1,3000,NULL) to start the timer. OnTimer gets
the timer interrupts and does some simple drawing and a few
calculations.

AFAIK, there is no long loop, and no polling in the UI thread.
However, there are 8 worker threads calculating at top speed. Could
the shared memory be an issue, if the workers are constantly updating
a counter, and the UI thread must read it?
Joseph M. Newcomer
2011-06-19 20:28:00 UTC
Permalink
Depends how you are handling the shared memory. As I indicated earlier, I find any such
solution deeply suspect. So in this case, you would have to show how you are handling the
shared memory. Note also that if you have eight threads, you are going to saturate any
machine with fewer than 9 cores.
joe
Post by Woody
Are you polling something?  One suspects you are executing
some kind of long loop in the GUI thread. You would have to show how
you handle the timer and the painting for further diagnosis.
UI thread does SetTimer(1,3000,NULL) to start the timer. OnTimer gets
the timer interrupts and does some simple drawing and a few
calculations.
AFAIK, there is no long loop, and no polling in the UI thread.
However, there are 8 worker threads calculating at top speed. Could
the shared memory be an issue, if the workers are constantly updating
a counter, and the UI thread must read it?
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer
2011-06-19 23:39:51 UTC
Permalink
Note that I am assuming that at no point in this whole execution do you execute a
WaitFor... API call in the main GUI thread. That would be Very Bad. It would be
particularly bad if you were using a mutex or a CRITICAL_SECTION to control access to the
"shared memory" (EnterCriticalSection == WaitFor... in terms of overall badness)

joe
Post by Joseph M. Newcomer
Depends how you are handling the shared memory. As I indicated earlier, I find any such
solution deeply suspect. So in this case, you would have to show how you are handling the
shared memory. Note also that if you have eight threads, you are going to saturate any
machine with fewer than 9 cores.
joe
Post by Woody
Are you polling something?  One suspects you are executing
some kind of long loop in the GUI thread. You would have to show how
you handle the timer and the painting for further diagnosis.
UI thread does SetTimer(1,3000,NULL) to start the timer. OnTimer gets
the timer interrupts and does some simple drawing and a few
calculations.
AFAIK, there is no long loop, and no polling in the UI thread.
However, there are 8 worker threads calculating at top speed. Could
the shared memory be an issue, if the workers are constantly updating
a counter, and the UI thread must read it?
Joseph M. Newcomer [MVP]
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Woody
2011-06-20 17:37:20 UTC
Permalink
In answer to Joe's points:

1) The shared memory is a few integer values. Since these are updated
very frequently by the worker threads, it would be expensive to use
messages. It also seems unnecessary.
2) There are indeed more threads (9) than CPUs (8). When designing
this, I assumed the GUI thread would only run when a screen update was
needed.
3) There are no WaitFor calls or ritical sections in the GUI thread.
In the worker threads, synchronization is done with interlocked
operations (e g, add, exchange), wherever possible, and otherwise,
with critical sections. Most of the critical sections in the worker
threads are synchronizing access to shared files.
4) I have not used any priority changes, other than specifying normal
priority when starting the worker threads.
5) Since the only messages being used are the timer and paint
messages, your dx of the WM_PAINT at low priority is probably the
cause of what I'm seeing. I will try adding UpdateWindow and see if
things improve.
6) I don't know how to deal with the possibility of the timer messages
not being received in a reasonable time. There is no need for real-
time, but the delay shouldn't be more than a few seconds.
Joseph M. Newcomer
2011-06-20 18:09:28 UTC
Permalink
See below...
Post by Woody
1) The shared memory is a few integer values. Since these are updated
very frequently by the worker threads, it would be expensive to use
messages. It also seems unnecessary.
****
Define "expensive". For example, you are about to execute a few million instructions to
update the display. A few hundred instructions for a SendMessage? Not even measurable.

The presumption of "expensive" means you actually have a cost basis you have determined; a
vague "feeling" that it is "expensive" is not good engineering practice.
****
Post by Woody
2) There are indeed more threads (9) than CPUs (8). When designing
this, I assumed the GUI thread would only run when a screen update was
needed.
****
Yes, but it has to wait for an available CPU to run on. You might be the victim of the
scheduler here, but it is hard to tell.
****
Post by Woody
3) There are no WaitFor calls or ritical sections in the GUI thread.
In the worker threads, synchronization is done with interlocked
operations (e g, add, exchange), wherever possible, and otherwise,
with critical sections. Most of the critical sections in the worker
threads are synchronizing access to shared files.
****
OK, that's very good. It does suggest, however, that you are a victim of the scheduler.
****
Post by Woody
4) I have not used any priority changes, other than specifying normal
priority when starting the worker threads.
****
That's good. Playing with priorities is sometimes necessary, but more often it is
destructive.
****
Post by Woody
5) Since the only messages being used are the timer and paint
messages, your dx of the WM_PAINT at low priority is probably the
cause of what I'm seeing. I will try adding UpdateWindow and see if
things improve.
****
Note that it is not clear what the scheduler does when a timer message is triggered; you
might be the victim of the scheduler thinking there is nothing to do. PostMessage may
force different behavior.
****
Post by Woody
6) I don't know how to deal with the possibility of the timer messages
not being received in a reasonable time. There is no need for real-
time, but the delay shouldn't be more than a few seconds.
****
This is a constant problem, losing timer messages. Some solutions including using using a
MM timer (timeSetEvent). Note that the timer notification is handled in a separate thread
and therefore must notify the GUI via PostMessage.

PostMessage cost is generally undetectable. The real risk of PostMessage is message queue
saturation which can induce long delays in getting things to happen, and can block timer
messages and paint messages indefinitely. Sending more than a couple messages per second
raises the risk of queue saturation, although I had to generate a couple thousand messages
per second to see seriously bad effects.

I'd suggest recording the current time in your OnTimer handler, and see if you are getting
the notifications in a, well, timely fashion. If you see huge gaps in the times, that
might suggest where the problem is.
joe
****
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer
2011-06-23 00:44:22 UTC
Permalink
Some more thoughts below....
Post by Woody
1) The shared memory is a few integer values. Since these are updated
very frequently by the worker threads, it would be expensive to use
messages. It also seems unnecessary.
*****
It is not clear why you would need to send a message on every update; for example, in
computing progress bars, I send one message every 1/100 of the computation, so I only send
100 update messages
*****
Post by Woody
2) There are indeed more threads (9) than CPUs (8). When designing
this, I assumed the GUI thread would only run when a screen update was
needed.
3) There are no WaitFor calls or ritical sections in the GUI thread.
In the worker threads, synchronization is done with interlocked
operations (e g, add, exchange), wherever possible, and otherwise,
with critical sections. Most of the critical sections in the worker
threads are synchronizing access to shared files.
*****
Interlocked operations are surprisingly expensive. For example, if all eight threads are
trying to update the same variable, they will each be forced to wait for the update of
other threads to complete. That is the InterlockedAdd, executed by eight threads, will
force a lockstep at the hardware level. This will force all the threads to wait for all
the other threads.

Then, such udpates are expensive because the x86 is cache-coherent, that is, if there is a
value in the L2 cache of CPU0 and CPU1 requires that value, CPU0 is forced to write it
back to memory, CPU1 has to wait for this memory cycle to complete, and then CPU1 has to
treat it as an L2 cache miss and fetch it from memory. This is 20-200 times slower than
access to unshared memory (if you were using critical sections it would be far worse, and
mutexes would be so slow as to be completely useless). But don't think those interlocked
operations are "free"; they aren't. If there is as much updating as your point (1)
suggests, this is a portential bottleneck in performance. Another reason to not used
shared memory (if each CPU is updating data that is unique to it but shared with the GUI
thread, this would be incredibly better, but I still don't like using shared memory that
has concurrent access)
*****
Post by Woody
4) I have not used any priority changes, other than specifying normal
priority when starting the worker threads.
5) Since the only messages being used are the timer and paint
messages, your dx of the WM_PAINT at low priority is probably the
cause of what I'm seeing. I will try adding UpdateWindow and see if
things improve.
****
Note that if there is no other message traffic, WM_PAINT will be immediate.
joe
****
Post by Woody
6) I don't know how to deal with the possibility of the timer messages
not being received in a reasonable time. There is no need for real-
time, but the delay shouldn't be more than a few seconds.
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Woody
2011-06-27 07:53:16 UTC
Permalink
Post by Joseph M. Newcomer
Post by Joseph M. Newcomer
It is not clear why you would need to send a message on every update; for example, in
computing progress bars, I send one message every 1/100 of the computation, so I only send
100 update messages
I am not sending any messages; I'm incrementing a few integer
counters. If I were to do something once every N steps, I would have
to have a few instructions to count to N (i e, determine when we have
reached 1% of the computation). The counters themselves are mostly
written by multiple threads, so they need to be protected by
interlocking operations. You have a point that there may be
competition for these, causing cache writethrough, and I suppose I
could give each thread its own counter, and sum these in the GUI.

The GUI reads the counters every few seconds, so the overhead is
trivial.
Post by Joseph M. Newcomer
Interlocked operations are surprisingly expensive.  For example, if all eight threads are
trying to update the same variable, they will each be forced to wait for the update of
other threads to complete.  That is the InterlockedAdd, executed by eight threads, will
force a lockstep at the hardware level.  This will force all the threads to wait for all
the other threads.
I don't think this is what InterlockedAdd or -Increment does. It just
uses the hw to guarantee that the read-add-write is atomic. There is
no requirement that the instruction be executed in order, so this by
itself would not block other threads.
Post by Joseph M. Newcomer
Note that if there is no other message traffic, WM_PAINT will be immediate.
But only when the GUI thread runs, and if that is not occurring
regularly due to the other issues you raised, its execution could be
delayed.
Joseph M. Newcomer
2011-06-27 19:28:12 UTC
Permalink
I'm not sure the cache IS a performance bottleneck, but it is an hypothesis to consider.
If the threads are compute-bound, and tight loops, the interlocking and cache update could
cost you a massive number of cycles. For example, ignoring conflict resolution, you might
lose 200 clock cycles per Interlocked operation. Eight threads contending for the same
counter at the same time could then add 1600 clock cycles overall. To each loop, in the
worst case, since each time through the loop you can assume maximum contention.

Note that the overhead of doing a modulo-N counter (particular if N is a
compile-time-constant) is VASTLY less overhead! (The division by a constant will most
likely be done by a multiplication, which is one cycle. See, for example,

http://www.flounder.com/multiplicative_inverse.htm ). So if you think the modulo-N
computation is expensive, you need to reconsider. Generally, a programmer's first guess
and performance is usually wrong unless you are a serious student of compiler technology
and an expert on the underlying architecture (if, for example, you don't know how a
pipelined superscalar with instruction prefetch, branch-prediction, speculative execution,
and hardware-managed cache coherency really works, you can't really say much about why one
code sequence might be better or worse than another). A multiply-compare-conditional
branch is probably one the order of 3-5 CPU cycles (that is, < 2ns, perhaps typically 1ns)
but an InterlockedIncrement is going to be somewhere between 100 and 200 CPU cycles
(30-60ns), so is going to be an order of magnitude (or two!) slower. (And if you say "I
looked at the code and it is really doing a DIV/IDIV instruction" you have better be
looking at the RELEASE code, not the DEBUG code; only then will I admit that the modulo-N
might be comparable to the interlocked overhead).

Your idea of having one counter per thread (not interlocked) and having the GUI thread
read the counters and sum them would be a good alternative.

Note that you don't need locking at all, because once every few seconds the GUI thread
wakes up, and if gets a "stale" value the count is off by at most 1, which probably won't
matter in the Great Scheme Of Things (I presume the counters, because of the frequency of
update, are going to have relatively large values in them). Since each counter is only
*modified* by at most one thread, no locking is necessary at all.

Note that you want the counters DWORD-aligned, so the following would be a Really Bad
Structure:

#pragma pack(1) // the default!
class whatever {
char ch;
DWORD counter;
... stuff
};

Interlocked operations will not work correctly unless the values are
sizeof(*target)-aligned.
joe
Post by Woody
Post by Joseph M. Newcomer
Post by Joseph M. Newcomer
It is not clear why you would need to send a message on every update; for example, in
computing progress bars, I send one message every 1/100 of the computation, so I only send
100 update messages
I am not sending any messages; I'm incrementing a few integer
counters. If I were to do something once every N steps, I would have
to have a few instructions to count to N (i e, determine when we have
reached 1% of the computation). The counters themselves are mostly
written by multiple threads, so they need to be protected by
interlocking operations. You have a point that there may be
competition for these, causing cache writethrough, and I suppose I
could give each thread its own counter, and sum these in the GUI.
The GUI reads the counters every few seconds, so the overhead is
trivial.
Post by Joseph M. Newcomer
Interlocked operations are surprisingly expensive.  For example, if all eight threads are
trying to update the same variable, they will each be forced to wait for the update of
other threads to complete.  That is the InterlockedAdd, executed by eight threads, will
force a lockstep at the hardware level.  This will force all the threads to wait for all
the other threads.
I don't think this is what InterlockedAdd or -Increment does. It just
uses the hw to guarantee that the read-add-write is atomic. There is
no requirement that the instruction be executed in order, so this by
itself would not block other threads.
Post by Joseph M. Newcomer
Note that if there is no other message traffic, WM_PAINT will be immediate.
But only when the GUI thread runs, and if that is not occurring
regularly due to the other issues you raised, its execution could be
delayed.
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Joseph M. Newcomer
2011-06-19 20:26:25 UTC
Permalink
There are several potential problems here.

First, I always get suspicious when someone says "by accessing shared memory" since I
consider that an extremely dangerous practice. I'd suggest figuring out how to do it
without using shared memory.

Second, if you have more threads than CPUs, your GUI thread might be starved for cycles.

Third, timer messages are problematic; they only happen if nothing else is going on.

Fourth, If you are using PostMessage to send messages to the GUI thread, you can end up in
a nasty situation I call "PostMessage queue saturation" in which the messages sent by the
threads have flooded the main GUI thread queue and delay everything.

Fifth, if you have done any SetThreadPriority calls, you are treading on very dangerous
ground, and pretty much are going to mess your response time over.

If I want to report progress, I tend to do it by using PostMessage to send messages to the
main GUI thread to provide the necessary updates. I have to be careful to avoid thread
queue saturation, but that is usually fairly straightforward.

If you are not suffering from thread queue saturation or timer message delays, then you
are simply suffering from thread starvation. Are you playing with thread priorities at
all? (This is dangerous, and one of the primary causes of thread starvation).

Note also that even lower than the WM_TIMER messages are the WM_PAINT messages that force
actual repainting; they happen only if there is truly nothing else going on, including
other timer messages. Again, thread queue saturation will be a big culprit here.
UpdateWindow calls will force the WM_PAINT messages to be handled for a specific window.
joe
Post by Woody
I have an MFC app which does very compute-intensive calculations. It
is multi-threaded, with the UI in the main thread, and several worker
threads which do the calculating. The UI thread does nothing except
report on the status of the workers. It gets this status by accessing
shared memory.
The UI thread is supposed to update some numbers on the display every
few seconds, so there is a timer to redraw the screen. With all the
threads at default priorities, what happens is that the screen update
only occurs after a long delay, 30 sec or more.
How should I arrange things so the UI thread displays when the timer
tells it to (every 3-5 sec), yet does not use any CPU cycles at other
times. Keyboard/mouse response is secondary, although if it doesn't
"cost" anything, it would be a plus.
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
Loading...