I'm not sure the cache IS a performance bottleneck, but it is an hypothesis to consider.
If the threads are compute-bound, and tight loops, the interlocking and cache update could
cost you a massive number of cycles. For example, ignoring conflict resolution, you might
lose 200 clock cycles per Interlocked operation. Eight threads contending for the same
counter at the same time could then add 1600 clock cycles overall. To each loop, in the
worst case, since each time through the loop you can assume maximum contention.
Note that the overhead of doing a modulo-N counter (particular if N is a
compile-time-constant) is VASTLY less overhead! (The division by a constant will most
likely be done by a multiplication, which is one cycle. See, for example,
http://www.flounder.com/multiplicative_inverse.htm ). So if you think the modulo-N
computation is expensive, you need to reconsider. Generally, a programmer's first guess
and performance is usually wrong unless you are a serious student of compiler technology
and an expert on the underlying architecture (if, for example, you don't know how a
pipelined superscalar with instruction prefetch, branch-prediction, speculative execution,
and hardware-managed cache coherency really works, you can't really say much about why one
code sequence might be better or worse than another). A multiply-compare-conditional
branch is probably one the order of 3-5 CPU cycles (that is, < 2ns, perhaps typically 1ns)
but an InterlockedIncrement is going to be somewhere between 100 and 200 CPU cycles
(30-60ns), so is going to be an order of magnitude (or two!) slower. (And if you say "I
looked at the code and it is really doing a DIV/IDIV instruction" you have better be
looking at the RELEASE code, not the DEBUG code; only then will I admit that the modulo-N
might be comparable to the interlocked overhead).
Your idea of having one counter per thread (not interlocked) and having the GUI thread
read the counters and sum them would be a good alternative.
Note that you don't need locking at all, because once every few seconds the GUI thread
wakes up, and if gets a "stale" value the count is off by at most 1, which probably won't
matter in the Great Scheme Of Things (I presume the counters, because of the frequency of
update, are going to have relatively large values in them). Since each counter is only
*modified* by at most one thread, no locking is necessary at all.
Note that you want the counters DWORD-aligned, so the following would be a Really Bad
Structure:
#pragma pack(1) // the default!
class whatever {
char ch;
DWORD counter;
... stuff
};
Interlocked operations will not work correctly unless the values are
sizeof(*target)-aligned.
joe
Post by WoodyPost by Joseph M. NewcomerPost by Joseph M. NewcomerIt is not clear why you would need to send a message on every update; for example, in
computing progress bars, I send one message every 1/100 of the computation, so I only send
100 update messages
I am not sending any messages; I'm incrementing a few integer
counters. If I were to do something once every N steps, I would have
to have a few instructions to count to N (i e, determine when we have
reached 1% of the computation). The counters themselves are mostly
written by multiple threads, so they need to be protected by
interlocking operations. You have a point that there may be
competition for these, causing cache writethrough, and I suppose I
could give each thread its own counter, and sum these in the GUI.
The GUI reads the counters every few seconds, so the overhead is
trivial.
Post by Joseph M. NewcomerInterlocked operations are surprisingly expensive. For example, if all eight threads are
trying to update the same variable, they will each be forced to wait for the update of
other threads to complete. That is the InterlockedAdd, executed by eight threads, will
force a lockstep at the hardware level. This will force all the threads to wait for all
the other threads.
I don't think this is what InterlockedAdd or -Increment does. It just
uses the hw to guarantee that the read-add-write is atomic. There is
no requirement that the instruction be executed in order, so this by
itself would not block other threads.
Post by Joseph M. NewcomerNote that if there is no other message traffic, WM_PAINT will be immediate.
But only when the GUI thread runs, and if that is not occurring
regularly due to the other issues you raised, its execution could be
delayed.
Joseph M. Newcomer [MVP]
email: ***@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm