This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Cygwin multithreading performance


Corinna Vinschen wrote:
On Dec  8 16:34, Corinna Vinschen wrote:
On Dec  8 02:51, Mark Geisert wrote:
   caller 0x018014CC77, count      1, L, /oss/src/winsup/cygwin/thread.cc:475
   caller 0x018014CD00, count      1, U, /oss/src/winsup/cygwin/thread.cc:496
   caller 0x018014CDAF, count    432, L, /oss/src/winsup/cygwin/thread.cc:971
   caller 0x018014CDE6, count    432, U, /oss/src/winsup/cygwin/thread.cc:982
   caller 0x018014D07E, count      1, L, /oss/src/winsup/cygwin/thread.cc:1946
   caller 0x018014D090, count      1, U, /oss/src/winsup/cygwin/thread.cc:1951
   caller 0x018014D7E6, count      1, L, /oss/src/winsup/cygwin/thread.cc:525
   caller 0x018014D7FF, count      1, U, /oss/src/winsup/cygwin/thread.cc:533
   caller 0x018014EDD7, count      1, U, /oss/src/winsup/cygwin/thread.cc:2400
   caller 0x018014EE97, count      1, L, /oss/src/winsup/cygwin/thread.cc:2389

This is interesting.  I'm not sure if anything in the rest of the
output shows how much is wasted on the above two calls, though.

thread.cc:971 and thread.cc:982 are pthread_setcancelstate, and it's
called pretty often as part of stdio functions.  Every stdio function
which has to lock the FILE structure also calls pthread_setcancelstate
to disable and reenable cancellation before and after locking.  That's
almost any stdio function.

This may be one of the problems which lower performance, but there's no
easy or quick way around that, AFAICS.

There's also the fact that, even for tools using __fsetlocking to disable
stdio locking, pthread_setcancelstate will still be called unconditionally.
The question here is, if that's wrong and pthread_setcancelstate should be
skipped if the application sets FSETLOCKING_BYCALLER.

For a start, I simply removed the mutex lock/unlock in calls to
pthread_setcancelstate and pthread_setcanceltype.  These locks are
completely unnecessary.  These functions are only called for the current
thread anyway.

I'm just creating a developer snapshot which I'll upload to
https://cygwin.com/snapshots/ in half an hour at the latest.  Please
have a look if your testcase behaves better now.

I'll give the snapshot a try, thanks, but it's hard for me to see :) how eliminating a few hundred locks and unlocks that are 75% under 1 usec anyway will help much. I may have unduly influenced anybody following this by only showing detailed stats for the last two locks that I picked at random.

What I'm more focused on are the locks being heavily used by git and heavily contended by the four git threads in OP's testcase. Here are the caller stats for this set of locks...

lock latency buckets:                            <=1   <=10  <=100 <=1000  >1000
mtx  4908/06002A00F0 lks 155066, ulks 155066,  66359   4395  78895   4742    675
  caller 0x0100454FAD, count 154792, L, /usr/src/git/builtin/pack-objects.c:1758
  caller 0x0100454FFC, count 154777, U, /usr/src/git/builtin/pack-objects.c:1769
  caller 0x010045557F, count     15, U, /usr/src/git/builtin/pack-objects.c:1760
  caller 0x010045588D, count     15, L, /usr/src/git/builtin/pack-objects.c:1944
  caller 0x01004558AC, count     15, U, /usr/src/git/builtin/pack-objects.c:1947
  caller 0x01004598AC, count     15, L, /usr/src/git/builtin/pack-objects.c:2039
  caller 0x01004599B9, count     11, U, /usr/src/git/builtin/pack-objects.c:2078
  caller 0x0100459A7B, count      4, U, /usr/src/git/builtin/pack-objects.c:2078
  caller 0x018014DB4F, count    244, U, /oss/src/winsup/cygwin/thread.cc:1254
  caller 0x018014DB93, count     15, L, /oss/src/winsup/cygwin/thread.cc:1286
  caller 0x018014DBD3, count    229, L, /oss/src/winsup/cygwin/thread.cc:1286

mtx  4908/0600FA6860 lks 154745, ulks 154745,  56092   3217  64883  25435   5118
  caller 0x0100455659, count   1011, L, /usr/src/git/builtin/pack-objects.c:1640
  caller 0x0100455687, count   1011, U, /usr/src/git/builtin/pack-objects.c:1642
  caller 0x01004556E0, count 153734, L, /usr/src/git/builtin/pack-objects.c:1628
  caller 0x010045570E, count 153734, U, /usr/src/git/builtin/pack-objects.c:1630

mtx  4908/0603498600 lks 316825, ulks 316825, 146254   4986 155345   9686    554
  caller 0x0100455269, count 196769, L, /usr/src/git/builtin/pack-objects.c:1695
  caller 0x01004552C4, count  15148, U, /usr/src/git/builtin/pack-objects.c:1705
  caller 0x0100455478, count 181621, U, /usr/src/git/builtin/pack-objects.c:1702
  caller 0x010045554C, count 120056, L, /usr/src/git/builtin/pack-objects.c:1834
  caller 0x010045556E, count 120056, U, /usr/src/git/builtin/pack-objects.c:1837

If you look at the ">1000" column, you see that these 3 locks account for 675+5118+554 = 6347ms = 6.347sec spent waiting for these locks, best case for just this 1000+ usec latency bucket. The next column to the left (a faster bucket for latencies 101us up to 1000us) has 4742+25435+9686 = 39863, times 101us best case for this bucket, giving 4.026sec spent waiting for these locks.

I probably need to modify my stats gathering program to display the total time spent waiting to make it more clear how things add up for each mutex across the whole testcase. Again, these stats are not timing how long locks are being held, but rather how long it takes just to acquire the lock.

It may be, when all is said and done, that this is the price one pays for POSIX compliant pthread_mutexes. It just kind of bugs me that the testcase works fine under MinGW using MS CriticalSection ops but we don't seem to have something similarly light-weight in our pthread_mutex implementation. Or I'm missing it :).
Thanks for reading,

..mark


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]