Squid3-trunk crashed with the message: pure virtual method called terminate called without an active exception Aborted (core dumped) backtrace: #0 0xb7fa7430 in __kernel_vsyscall () (gdb) backtrace #0 0xb7fa7430 in __kernel_vsyscall () #1 0xb79a3880 in raise () from /lib/tls/i686/cmov/libc.so.6 #2 0xb79a5248 in abort () from /lib/tls/i686/cmov/libc.so.6 #3 0xb7bc86f8 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6 #4 0xb7bc65d5 in ?? () from /usr/lib/libstdc++.so.6 #5 0xb7bc6612 in std::terminate () from /usr/lib/libstdc++.so.6 #6 0xb7bc72b5 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6 #7 0x081f2a05 in JobDialer (this=0xbf9a8270, aJob=0x8df64d8) at ICAP/AsyncJob.cc:172 warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.) #8 0x08240f27 in NullaryMemFunT (this=warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.) warning: (Internal error: pc 0x8240ee2 in read in psymtab, but not in symtab.) warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.) 0xbf9a8270, anObject=warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.) 0x8df64b0, aMethod=warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.) &virtual table offset 8) at ../../src/AsyncJobCalls.h:33 warning: (Internal error: pc 0x8240f88 in read in psymtab, but not in symtab.) warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.) warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.) #9 0x08240f88 in MemFun<Adaptation::Initiate> (object=warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.) warning: (Internal error: pc 0x8240f53 in read in psymtab, but not in symtab.) warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.) 0x8df64b0, method=warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.) &virtual table offset 8) at ../../src/AsyncJobCalls.h:76 #10 0x08240d2b in Adaptation::Initiator::announceInitiatorAbort ( this=0x8dde4f8, x=@0x8dde85c) at Initiator.cc:28 #11 0x0810f5af in ~ClientHttpRequest (this=0x8dde4f8) at client_side_request.cc:263 #12 0x080faad7 in ~ClientSocketContext (this=0xb6faf0c0) at client_side.cc:256 #13 0x080fb088 in ConnStateData::freeAllContexts (this=0x8dd5e04) at ../include/RefCount.h:100 #14 0x080fd1ec in ConnStateData::swanSong (this=0x8dd5e04) at client_side.cc:617 ---Type <return> to continue, or q <return> to quit--- #15 0x081f3df1 in AsyncJob::callEnd (this=0x8dd5ee4) at ICAP/AsyncJob.cc:135 #16 0x081f30c9 in JobDialer::dial (this=0x8beced4, call=@0x8beceb8) at ICAP/AsyncJob.cc:222 #17 0x08106958 in AsyncCallT<CommCbMemFunT<ConnStateData, CommCloseCbParams> >::fire (this=0x8beceb8) at AsyncCall.h:129 #18 0x080e2e26 in AsyncCall::make (this=0x8beceb8) at AsyncCall.cc:34 #19 0x080e2176 in AsyncCallQueue::fireNext (this=0x8b9ee50) at AsyncCallQueue.cc:53 #20 0x080e2320 in AsyncCallQueue::fire (this=0x8b9ee50) at AsyncCallQueue.cc:39 #21 0x0814bb47 in EventLoop::runOnce (this=0xbf9a8514) at EventLoop.cc:131 #22 0x0814bc10 in EventLoop::run (this=0xbf9a8514) at EventLoop.cc:95 #23 0x081971a3 in main (argc=5, argv=0xbf9a8604) at main.cc:1350 (gdb)
Some more info: (gdb)frame 7 #7 0x081f2a05 in JobDialer (this=0xbf9a8270, aJob=0x8df64d8) at ICAP/AsyncJob.cc:172 172 lock = cbdataReference(aJob->toCbdata()); (gdb) print *aJob $1 = {_vptr.AsyncJob = 0x82cee28, stopReason = 0x0, typeName = 0x8288052 "ICAPModXactLauncher", inCall = {p_ = 0x0}, id = 22958, static TheLastId = 22959} (gdb)
Created attachment 1898 [details] a patch solving the problem Inside the Adaptation::Initiator::announceInitiatorAbort method the x==NULL check is not enough. We must also check if the x variable (of type Initiate) is cbdataValid.
Applied to trunk
Applied to Squid 3.1
I think the patch masks the problem but does not solve it. We should not be testing whether x is cbdata-valid here unless we want to test it everywhere (and we do not want that!). Adaptation::Initiator::announceInitiatorAbort does not dereference x and, hence, should not care whether x is valid. Can you find which function or macro are dereferencing x and fix _that_ code, please? Thank you.
(In reply to comment #5) > I think the patch masks the problem but does not solve it. We should not be > testing whether x is cbdata-valid here unless we want to test it everywhere > (and we do not want that!). OK > > Adaptation::Initiator::announceInitiatorAbort does not dereference x and, > hence, should not care whether x is valid. OK I understand. > > Can you find which function or macro are dereferencing x and fix _that_ code, > please? I will try. I need to find the test case again. > > Thank you. >
(In reply to comment #5) > Adaptation::Initiator::announceInitiatorAbort does not dereference x and, > hence, should not care whether x is valid. However this is not absolutely true. It calls the Adaptation::Initiator::clearAdaptation which do a cbdataReferenceDone(x).
(In reply to comment #7) > (In reply to comment #5) > > Adaptation::Initiator::announceInitiatorAbort does not dereference x and, > > hence, should not care whether x is valid. > > However this is not absolutely true. It calls the > Adaptation::Initiator::clearAdaptation which do a cbdataReferenceDone(x). Again, we should not check what we do not use. Adaptation::Initiator::announceInitiatorAbort does not use *x and, hence, should not care whether the object pointed by x is valid. Now, you are right that Adaptation::Initiator::clearAdaptation calls cbdataReferenceDone(x). But clearAdaptation does not dereference x either! I am trying to prevent two distinct things from happening here: 1) Somebody looking at the patched code and deciding that they always need to check whether a pointer to the job is valid before sending a message to that job. A year later, somebody else looks at all those checks and decides that when the message is being sent, the job must be valid and writes the code with that incorrect assumption in mind. Two years after that, the AsyncCalls would need to be replaced because the old code becomes unusable due to a mix wrong assumptions. 2) There is clearly some buggy code that dereferences a job pointer without checking its validity. The proposed fix masks that bug. The bug will remain in the code and cause more problems down the road. In fact, there could be a bigger design problem lurking here (see my rant below). My unverified suspicion is that the problem is with AsyncJob::toCbdata() hack. We should not access virtual x->toCbdata if x is invalid but we need that access to check whether x is invalid. I think this has been discussed somewhere. I need to research it, but do not let that block you! BTW, in the committed fix, how do you know that "(void*)x" is a cbdata pointer? Could it point to some other part of the cbdata-protected object? This was the reason we added toCbdata(), IIRC.
(In reply to comment #8) > > My unverified suspicion is that the problem is with AsyncJob::toCbdata() hack. > We should not access virtual x->toCbdata if x is invalid but we need that > access to check whether x is invalid. I think this has been discussed > somewhere. I need to research it, but do not let that block you! I think this was the first think I checked when I looked into this bug. But I did not found any wrong with cbdata. > > BTW, in the committed fix, how do you know that "(void*)x" is a cbdata pointer? > Could it point to some other part of the cbdata-protected object? This was the > reason we added toCbdata(), IIRC. > I can not answer this question right now but as I can remember I had check carefully related cases. I remembered the problems we had with cbdata and the reason we added toCbdata, and this was my first thought about the problem (I paniced). The cbdata pointer passed was the correct pointer just the data was corrupted. But maybe I am loosing something. And even if the problem is just "not cbdata valid data" it is good to know the exact reason.
If the error was caused by cbdataReference(aJob->toCbdata()) and the fix is checking, essentially, cbdataReferenceValid(aJob) then the fix may not just mask a bug, but is a bug by itself because aJob may not be a cbdata pointer. Another possibility is that x is not a job pointer, of course. Either way, the combination of the above two calls is invalid. Can you reproduce this bug on demand?
(In reply to comment #10) > Can you reproduce this bug on demand? > Nope. It was start happened again and again while browsing one site. I can not reproduce it any more browsing the same site. Maybe it was related with icap server problems. Xmm...
I can not reproduce this bug. I think, it is still here because I hit an assertion I put in Adaptation::Initiator::announceInitiatorAbort, but only once and the core file was bad, with no debug info. I did not keep the debug info when I discover this bug. I believed that it was easy to reproduce it (@@#$&^%#@?!)... What should we do now? Is it OK to add an assertion checking in ClientHttpRequest::~ClientHttpRequest() just before we call the announceInitiatorAbort method? The "pure virtual method called" message is bad.....
(In reply to comment #12) > What should we do now? Is it OK to add an assertion checking in > ClientHttpRequest::~ClientHttpRequest() just before we call the > announceInitiatorAbort method? > The "pure virtual method called" message is bad..... Adding a correct assertion is a good idea, of course. Just do not add it to the function that does not dereference x. Add it to the code that does. Perhaps it will be even triggered sooner there. It would be terrific if you could verify that we always 1) Use foo->toCbdata() when converting an AsyncJob to a cbdata-protected void* pointer and 2) Use static_cast<AsyncJob*>(cbdata) when converting a cbdata-protected void* pointer to a job. It is probably a good idea to add static AsyncJob::FromCbdata(void*) method for this purpose, BTW. The same static method could check void* pointer validity before the conversion and throw or return NULL if the pointer is invalid. If the above two rules are always followed, we should not run into the bug reported here. As you can guess from the above discussion, I am worried that in the current code, a) x is not always x->toCbdata() or b) void* pointer is converted to an AsyncJob child without going through the AsyncJob or c) void* pointer is converted to an AsyncJob child when the pointer is not cbdata-valid. HTH.
Created attachment 1925 [details] patch for moving the validation of cbdata from Adaptation::Initiator::announceInitiatorAbort to the caller This patch removes the cbdata validation from the Adaptation::Initiator::announceInitiatorAbort method and adds a related check inside the ClientHttpRequest::~ClientHttpRequest destructor just before call the Adaptation::Initiator::announceInitiatorAbort method.
(In reply to comment #13) > Adding a correct assertion is a good idea, of course. Just do not add it to the > function that does not dereference x. Add it to the code that does. Perhaps it > will be even triggered sooner there. it is OK the patch in attachment (id=1925) >....... > reported here. As you can guess from the above discussion, I am worried that in > the current code, > > a) x is not always x->toCbdata() or I do not think that this was the problem. I am sure I had test it and the x was the x->toCbdata .... > b) void* pointer is converted to an AsyncJob child without going through the > AsyncJob or > c) void* pointer is converted to an AsyncJob child when the pointer is not > cbdata-valid. Looks possible. But it is very difficult for me to check it without reproducing it. This bug disappeared and I did not hit it again. I am not planning to work more on this bug at least until I can reproduce it.
The patch in attachment (id=1925) applied to trunk
(In reply to comment #16) > The patch in attachment (id=1925) [details] applied to trunk I [still] disagree with this patch because it asserts something that does not have to be true. The virginHeadSource job pointer may become cbdata-invalid if that job ends before we have a chance to call it back. Why are we asserting that it is valid? It is possible that AsyncJobDialer needs to be fixed to avoid calling cbdataReference in its constructor if aJob is cbdata-invalid: JobDialer::JobDialer(AsyncJob *aJob): job(NULL), lock(NULL) { if (aJob) { lock = cbdataReference(aJob->toCbdata()); job = aJob; } } However, this brings us back to my original concern: If a job pointer becomes cbdata-invalid, we cannot call aJob->toCbdata(). On the other hand, we must call aJob->toCbdata() to find out whether the job pointer is cbdata-valid. Why did you apply the patch? Thank you, Alex.
(In reply to comment #17) > (In reply to comment #16) > > The patch in attachment (id=1925) [details] [details] applied to trunk > > I [still] disagree with this patch because it asserts something that does not > have to be true. The virginHeadSource job pointer may become cbdata-invalid if > that job ends before we have a chance to call it back. Why are we asserting > that it is valid? > > It is possible that AsyncJobDialer needs to be fixed to avoid calling > cbdataReference in its constructor if aJob is cbdata-invalid: > > JobDialer::JobDialer(AsyncJob *aJob): job(NULL), lock(NULL) > { > if (aJob) { > lock = cbdataReference(aJob->toCbdata()); > job = aJob; > } > } > > However, this brings us back to my original concern: If a job pointer becomes > cbdata-invalid, we cannot call aJob->toCbdata(). On the other hand, we must > call aJob->toCbdata() to find out whether the job pointer is cbdata-valid. > > Why did you apply the patch? The first patch, which was not correct, never removed from trunk. So I remove it and I just add the assertion to have a better crash than the "pure virtual method called" message. OK, I will remove the assertion too.
Reverted from 3.1
I believe Christos have inadvertently confirmed my fears while working on an ICAP project. Here is the summary: AsyncJob::callEnd deletes the job if the job is completed. If there are no cbdataReference "locks" on the job pointer, the job state (including cbdata!) is destroyed. After that, calling the job methods or even dereferencing the job pointer becomes unsafe. We should always hold the lock on the job (via cbdataReference) if we want to communicate with the job in the future. We cannot just store a job pointer. In fact, pointers to jobs in general code should be outlawed. We cannot safely call job->cbdata() if the job has been freed and, hence, we cannot know job's cbdata pointer! We have to store both the cbdata pointer and job pointer to be able to communicate with the job. This is very similar to what JobDialer already does, but we cannot create JobDialer when the transaction may have already been destroyed! We must create it (or a similar class) _before_ that happens. The right general solution here may be a smart JobPointer class that solves this problem by storing an AsyncJob pointer and a cbdata pointer, just like JobDialer does. Regular Job pointers must be replaced with JobPointers.
> We cannot safely call job->cbdata() if the job has been freed > and, hence, we cannot know job's cbdata pointer! We have to store > both the cbdata pointer and job pointer to be able to communicate > with the job. If cbdata is a non virtual method it can be called safely at any time; this will be the value of the pointer, and can do validation as needed.
(In reply to comment #21) > > We cannot safely call job->cbdata() if the job has been freed > > and, hence, we cannot know job's cbdata pointer! We have to store > > both the cbdata pointer and job pointer to be able to communicate > > with the job. > > If cbdata is a non virtual method it can be called safely at any time; this > will be the value of the pointer, and can do validation as needed. cbdata() is virtual.
(In reply to comment #22) > (In reply to comment #21) > > > We cannot safely call job->cbdata() if the job has been freed > > > and, hence, we cannot know job's cbdata pointer! We have to store > > > both the cbdata pointer and job pointer to be able to communicate > > > with the job. > > > > If cbdata is a non virtual method it can be called safely at any time; this > > will be the value of the pointer, and can do validation as needed. > > cbdata() is virtual. Just wanted to clarify in case somebody tries to use this bug report as a documentation piece :-) AsyncJob::toCbdata() is virtual. It returns the pointer to object's cbdata. That cbdata pointer is not always the same as "this" pointer, due to multiple inheritance. AsyncJob is not a CBDATA_CLASS but its kids are. Thus, AsyncJob kids get a virtual toCbdata() even though the CBDATA_CLASS macros do not explicitly declare toCbdata() as virtual. Any specific job class must be a CBDATA_CLASS so that the right cbdata-specific new/delete methods are called, but I am not sure that is enforced. All this ugliness is meant to work around the incompatibilities between a C hack (cbdata) and C++ multiple inheritance. Cbdata-compatible improvement suggestions are very welcome. Hopefully, we will improve or replace cbdata API in Squid v3.3 to avoid this mess.
Is there any reason we can't use a template rather than subclassing to get the specialesed cbdata information?
(In reply to comment #24) > Is there any reason we can't use a template rather than subclassing to get the > specialesed cbdata information? Sorry, I am not sure what you mean. Which subclasses do you want to replace and with what template?
Looks like this bug is still in v3.1: I saw a few recent crashes with similar symptoms. I am working on a fix. I tried to assign the bug to myself but it looks like the new bugzilla makes that impossible.
(In reply to comment #26) > Looks like this bug is still in v3.1: I saw a few recent crashes with similar > symptoms. I am working on a fix. > I tried to assign the bug to myself but it looks like the new bugzilla makes > that impossible. We are doing performance test of squid and our icap server, and this problem occurs from time to time. We use loadrunner to send the same request to squid, up to 2500 tps. The bug can be reproduced within hours. Looking forward to the fix.
(In reply to comment #27) > The bug can be reproduced within hours. > Looking forward to the fix. Would you be interested in and able to test preliminary patches for this bug? I will post them here if you will test. Thank you.
(In reply to comment #28) > (In reply to comment #27) > > The bug can be reproduced within hours. > > Looking forward to the fix. > Would you be interested in and able to test preliminary patches for this bug? I > will post them here if you will test. > Thank you. Pleasure to help to test.
Created attachment 2052 [details] take5 The attached patch is our current diff against Squid3 trunk. It might apply to Squid v3.1. You should bootstrap because we added on header file. These changes have not been tested yet so the patch may crash your Squid immediately. Please keep me posted on your testing progress.
(In reply to comment #30) > Created an attachment (id=2052) [details] > take5 > The attached patch is our current diff against Squid3 trunk. It might apply to > Squid v3.1. You should bootstrap because we added on header file. These changes > have not been tested yet so the patch may crash your Squid immediately. > Please keep me posted on your testing progress. Ok. I'll apply it to 3.1.0.13 and post test result here.
(In reply to comment #31) > (In reply to comment #30) > > Created an attachment (id=2052) [details] [details] > > take5 > > The attached patch is our current diff against Squid3 trunk. It might apply to > > Squid v3.1. You should bootstrap because we added on header file. These changes > > have not been tested yet so the patch may crash your Squid immediately. > > Please keep me posted on your testing progress. > Ok. I'll apply it to 3.1.0.13 and post test result here. It crashed immediately after receiving a request. Segmentation fault Here is the last part of cache.log: AcceptFD::acceptOne accepted: FD 13 newfd: 9 from: 150.236.80.251:5945 handler: SomeCommAcceptHandler(FD -1, data=0x9e5f0d8) 2009/09/23 12:03:01.756| comm.cc(2324) will call SomeCommAcceptHandler(FD 13, data=0x9e5f0d8, newFD 9) [call5] 2009/09/23 12:03:01.756| entering SomeCommAcceptHandler(FD 13, data=0x9e5f0d8, newFD 9) 2009/09/23 12:03:01.756| AsyncCall.cc(32) make: make call SomeCommAcceptHandler [call5] 2009/09/23 12:03:01.756| comm_accept: FD 13 handler: 0x80dcd00 2009/09/23 12:03:01.756| The AsyncCall SomeCommAcceptHandler constructed, this=0xa117d40 [call20] 2009/09/23 12:03:01.756| comm.cc(2289) acceptOne: try later: FD 13 handler: SomeCommAcceptHandler(FD -1, data=0x9e5f0d8) 2009/09/23 12:03:01.756| httpAccept: FD 9: accepted 2009/09/23 12:03:01.756| AsyncJob of type ConnStateData constructed, this=0xa117f6c [async1] 2009/09/23 12:03:01.756| The AsyncCall ConnStateData::connStateClosed constructed, this=0xa11be50 [call21] 2009/09/23 12:03:01.756| comm_add_close_handler: FD 9, AsyncCall=0xa11be50*1 2009/09/23 12:03:01.756| The AsyncCall ConnStateData::requestTimeout constructed, this=0xa11bea0 [call22] 2009/09/23 12:03:01.756| comm.cc(1185) commSetTimeout: FD 9 timeout 900 2009/09/23 12:03:01.756| clientReadSomeData: FD 9: reading request... 2009/09/23 12:03:01.756| The AsyncCall ConnStateData::clientReadRequest constructed, this=0xa11bef0 [call23] 2009/09/23 12:03:01.756| comm_read, queueing read for FD 9; asynCall 0xa11bef0*1 2009/09/23 12:03:01.756| leaving SomeCommAcceptHandler(FD 13, data=0x9e5f0d8, newFD 9) 2009/09/23 12:03:01.757| comm_read_try: FD 9, size 4095, retval 574, errno 0 2009/09/23 12:03:01.757| commio_finish_callback: called for FD 9 (0, 0) 2009/09/23 12:03:01.757| comm.cc(163) will call ConnStateData::clientReadRequest(FD 9, data=0xa117e58, size=574, buf=0xb7e0d008) [call23] 2009/09/23 12:03:01.757| entering ConnStateData::clientReadRequest(FD 9, data=0xa117e58, size=574, buf=0xb7e0d008) 2009/09/23 12:03:01.757| AsyncCall.cc(32) make: make call ConnStateData::clientReadRequest [call23
Thank you for testing! There will be a better version of the patch available within 24 hours. Did the patch apply cleanly to v3.1.0.13?
(In reply to comment #33) > Thank you for testing! There will be a better version of the patch available > within 24 hours. > Did the patch apply cleanly to v3.1.0.13? Yes. I applied it to squid-3.1.0.13-20090917.
Created attachment 2053 [details] take6 This patch against trunk r9997 survived some mild abuse but more testing is needed. Bootstrapping is still required. Please test.
(In reply to comment #35) > Created an attachment (id=2053) [details] > take6 > This patch against trunk r9997 survived some mild abuse but more testing is > needed. Bootstrapping is still required. Please test. Ok.
(In reply to comment #36) > (In reply to comment #35) > > Created an attachment (id=2053) [details] [details] > > take6 > > This patch against trunk r9997 survived some mild abuse but more testing is > > needed. Bootstrapping is still required. Please test. > Ok. I applied the new patch to squid 3.1.0.13(release candidate released on 4th August) and did some basic test. So far the result is good. We plan to do performance test on Sunday since our new test environment is not ready until then. Hopefully we'll get performance result next week and we'll keep you updated. Thanks.
(In reply to comment #37) > (In reply to comment #36) > > (In reply to comment #35) > > > Created an attachment (id=2053) [details] [details] [details] > > > take6 > > > This patch against trunk r9997 survived some mild abuse but more testing is > > > needed. Bootstrapping is still required. Please test. > > Ok. > I applied the new patch to squid 3.1.0.13(release candidate released on 4th > August) and did some basic test. So far the result is good. > We plan to do performance test on Sunday since our new test environment is not > ready until then. Hopefully we'll get performance result next week and we'll > keep you updated. > Thanks. I will be in charge of performance test and I will update performance result to you.
(In reply to comment #35) > Created an attachment (id=2053) [details] > take6 > This patch against trunk r9997 survived some mild abuse but more testing is > needed. Bootstrapping is still required. Please test. We have applied this patch to squid-3.1.0.13 and test it. It didn't crash. But we see some icap broken exception in log in performance test up to 400 tps. Log is as follows: 2009/09/30 12:23:55.845| entering AsyncJob::noteStart() 2009/09/30 12:23:55.845| AsyncCall.cc(32) make: make call AsyncJob::noteStart [call832772] 2009/09/30 12:23:55.845| Adaptation::Icap::ModXact status in: [G/R icapx211440] 2009/09/30 12:23:55.845| ModXact.cc(1541) estimateVirginBody: does not expect virgin body 2009/09/30 12:23:55.845| ModXact.cc(98) waitForService: will wait for the ICAP service [G/R icapx211440] 2009/09/30 12:23:55.845| The AsyncCall Adaptation::Icap::ModXact::noteServiceReady constructed, this=0xe09900 [call832967] 2009/09/30 12:23:55.845| ServiceRep.cc(201) callWhenReady: Adaptation::Icap::Service is asked to call Adaptation::Icap::ModXact::noteServiceReady() when ready [down,susp,stale,fail11] 2009/09/30 12:23:55.845| ../../src/base/AsyncJob.h(151) dial: AsyncJob::noteStart threw exception: !broken() 2009/09/30 12:23:55.845| Xaction.cc(438) setOutcome: ICAP_ERR_OTHER 2009/09/30 12:23:55.845| Adaptation::Icap::ModXact will stop, reason: exception 2009/09/30 12:23:55.845| AsyncJob::noteStart() ends job [UG/R icapx211440] 2009/09/30 12:23:55.845| ModXact.cc(1095) swanSong: swan sings [UG/R icapx211440] 2009/09/30 12:23:55.845| ModXact.cc(443) stopWriting: will no longer write [UG/R icapx211440] 2009/09/30 12:23:55.845| ModXact.cc(550) stopSending: Enter stop sending 2009/09/30 12:23:55.845| ModXact.cc(553) stopSending: Proceed with stop sending 2009/09/30 12:23:55.845| ModXact.cc(565) stopSending: will not start sending [UG/Rw icapx211440] 2009/09/30 12:23:55.845| Initiate.cc(58) swanSong: swan sings [UG/RwS icapx211440] 2009/09/30 12:23:55.845| Initiate.cc(65) swanSong: swan sang [UG/RwS icapx211440] 2009/09/30 12:23:55.845| Adaptation::Icap::ModXact destructed, this=0xf4a638 [icapx211440] 2009/09/30 12:23:55.845| AsyncJob.cc(139) callEnd: AsyncJob::noteStart() ended 0xf4abf0 2009/09/30 12:23:55.845| leaving AsyncJob::noteStart() We do same performance test to original squid-3.1.0.13 without patch. It crashed, we don't find this exception log in it. Is this exception a bug of this patch? Now we test it in a temporary test environment and tps can't be very high. We will start to test it in normal test environment on Oct.10th.
> Adaptation::Icap::Service is asked to call > Adaptation::Icap::ModXact::noteServiceReady() when ready > [down,susp,stale,fail11] > 2009/09/30 12:23:55.845| ../../src/base/AsyncJob.h(151) dial: > AsyncJob::noteStart threw exception: !broken() The above smells like a regular exception associated with a down, suspended ICAP service. The service failed 11 times and got suspended because of those failures. ICAP transactions do not wait for suspended services (because there is usually no point to wait as the service will remain unusable for some time). I cannot be sure, but this problem does not look directly related to the patch unless there were/are some other bugs that the fix in take6 exposed. Any signs of trouble on the ICAP server side? Do you have a longer log showing ICAP OPTIONS exchange or other ICAP failures? Thank you.
Created attachment 2059 [details] log for squid-13 with patch. In the attachment log file, we can see there are 11 exception of cannot connect to the ICAP service and then throw broken exception. So I think maybe it is a regular exception. We don't meet these log in original squid-13 because it dump before icap down. We will do more tests to watch it. Thanks.
The ICAP exceptions in your log look correct to me. Your Squid is running out of file descriptors and, hence, has trouble opening new connections for ICAP requests: comm_open: socket failure: (24) Too many open files Xaction.cc(218) dieOnConnectionFailure: Adaptation::Icap::ModXact failed to connect to icap://150.236.238.164:1345/request This goes own until the ICAP service is marked as down after 10 failures: ServiceRep.cc(74) noteFailure: failure 11 out of 10 allowed [up,fail11] 2009/09/30 12:22:31.867| suspending ICAP service for too many failures After that, new ICAP transactions throw !broken exceptions, which is correct behavior: dial: AsyncJob::noteStart threw exception: !broken() Xaction.cc(438) setOutcome: ICAP_ERR_OTHER Adaptation::Icap::ModXact will stop, reason: exception This should stop after Squid tries to reach the the ICAP service again and has enough descriptors to do it. I would not be surprised if the "running out of descriptors" condition was the trigger for the bug we are trying to fix here because it leads to ICAP transaction destruction during transaction startup. I bet the unpatched code cannot handle that reliably.
(In reply to comment #42) > The ICAP exceptions in your log look correct to me. Your Squid is running out > of file descriptors and, hence, has trouble opening new connections for ICAP > requests: > comm_open: socket failure: (24) Too many open files > Xaction.cc(218) dieOnConnectionFailure: Adaptation::Icap::ModXact failed to > connect to icap://150.236.238.164:1345/request > This goes own until the ICAP service is marked as down after 10 failures: > ServiceRep.cc(74) noteFailure: failure 11 out of 10 allowed [up,fail11] > 2009/09/30 12:22:31.867| suspending ICAP service for too many failures > After that, new ICAP transactions throw !broken exceptions, which is correct > behavior: > dial: AsyncJob::noteStart threw exception: !broken() > Xaction.cc(438) setOutcome: ICAP_ERR_OTHER > Adaptation::Icap::ModXact will stop, reason: exception > This should stop after Squid tries to reach the the ICAP service again and has > enough descriptors to do it. > I would not be surprised if the "running out of descriptors" condition was the > trigger for the bug we are trying to fix here because it leads to ICAP > transaction destruction during transaction startup. I bet the unpatched code > cannot handle that reliably. We don't meet virtual dump problem after many performance tests. Thank you very much.
Created attachment 2065 [details] fix for uncaught exception l The attached patch should fix the following bug in patch take6: 15:42:40| dying from an unhandled exception: l
(In reply to comment #44) > Created an attachment (id=2065) [details] > fix for uncaught exception l > The attached patch should fix the following bug in patch take6: > 15:42:40| dying from an unhandled exception: l OK, I will apply it to our squid.
I can still reproduce these crashes with 3.1.4. I have tried the patches (they can be made to apply with some tweaking) and the crashes disappear, but squid seems to be leaking memory rather badly with them.
Created attachment 2180 [details] updated patch for 3.1.4 this was the patch I used
Created attachment 2181 [details] memory leak report from valgrind this is the biggest memory leak reported by valgrind (and the only significant one reported as 'definitely lost')
(In reply to comment #48) > Created attachment 2181 [details] > memory leak report from valgrind > > this is the biggest memory leak reported by valgrind (and the only significant > one reported as 'definitely lost') Thank you for testing and updating the patch. Did you build Squid with --with-valgrind-debug to avoid false positives in valgrind reports?
(In reply to comment #49) > Thank you for testing and updating the patch. Did you build Squid with > --with-valgrind-debug to avoid false positives in valgrind reports? Yes, I have used --with-valgrind-debug
AFAICT, your valgrind report says that HttpRequest allocated inside Adaptation::Icap::ModXact::encapsulateHead() is leaking. That method calls both new and delete for HttpRequest. The only way a leak is possible in-between is if there is an exception thrown, but that should not happen (yes, we should use auto_ptr here). Similar code is running in production without [significant] leaks, with ICAP enabled, but it is possible that it does not use the exact same code paths, of course. Can you enable full debugging (debug_options ALL,9) and run Squid for a while so that the leak becomes obvious, posting both cache.log and valgrind log (all records not just the summary)? Thank you.
(In reply to comment #47) > Created attachment 2180 [details] > updated patch for 3.1.4 > > this was the patch I used If you have a chance, please remove irrelevant hunks from the updated patch. For example, there are a few large hunks comparing two .orig files.
Created attachment 2182 [details] patch for 3.1.4 with the .orig files removed > If you have a chance, please remove irrelevant hunks from the updated patch. > For example, there are a few large hunks comparing two .orig files. Sorry, fixed.
Thanks for your quick response. (In reply to comment #51) > AFAICT, your valgrind report says that HttpRequest allocated inside > Adaptation::Icap::ModXact::encapsulateHead() is leaking. That method calls both > new and delete for HttpRequest. The only way a leak is possible in-between is > if there is an exception thrown, but that should not happen (yes, we should use > auto_ptr here). Actually, I was fuzzing squid with requests that are aborted at various stages or that may be invalid. If I use only ab for doing requests squid does not leak. Also, the ICAP server was not running during the tests. I will try to provide more verbose logging later.
Comment on attachment 2181 [details] memory leak report from valgrind It seems valgrind support was not active in squid while creating this because I didn't have valgrind installed at configure/compile time.
I tested the provided patch (fixed version from sf) as follows: Squid Cache: Version 3.1.4 configure options: '--prefix=/usr/local/squid3.1' '--enable-icap-client' '--with-default-user=proxy' '--with-large-files' '--enable-follow-x-forwarded-for' '--without-pthreads' '--enable-ltdl-convenience' '--enable-storeio=ufs,aufs' '--with-squid=/home/user/squid-3.1.4' --with-squid=/home/user/squid-3.1.4 Config file is mostly standard, ICAP is _not_ enabled. I have a test script to trigger the same requests as sf (attached). I managed to trigger the 'pure virtual method' and in squid 3.1.3 i managed to trigger bug #2896. If I run that script right now I get (reliably) a premature exit with exit code 1.
Created attachment 2183 [details] small script to trigger problems A small hacky script to trigger bug 2583. Sends a number of (possibly incomplete) requests to a webserver, any webserver should be fine. In my case its an apache1.3 with a modperl extension that accepts any url. You have to change the values $host and $proxy in the script.
FWIW, I was unable to reproduce the problem using your script and the 3p1-rock lp branch from where the original fixes came from. After many requests, the script quits with "now reading remote header" and "Alarm clock" messages on the console. Squid keeps running. Perhaps my Squid configuration, code, or environment differs too much. Should any origin server (without any special configuration) work for your test cases? Does it need to have some PUT/POST-accepting CGI? If I have the time, I will try stock v3.1.4. Meanwhile, you may want to post more details about the Squid exit in your environment (error messages, backtrace, etc.).
I could not reproduce the big leak from comment 48 so far. But the logs at http://www.sfritsch.de/~stf/squid/run2/ show some other leaks and some invalid memory accesses. During the run I stopped the ICAP server while squid was under load. This lead to a significant increase in squid's VIRT and RSS memory size. Maybe the leaks shown by valgrind are related to the ICAP server being stopped. Squid (or valgrind?) continued to use 100% CPU after the load stopped. It still had 114 connections to the origin server open, despite the client script never having created more than 75 connections in parallel. After some minutes I have stopped the origin server too. After 1-2 more minutes, squid's CPU usage went down to 0% again. The first leak report in the valgrind output was produced by calling mgr:mem after squid had settled to 0% CPU. The second one was created by squid's shutdown. I will try to create more logs.
(In reply to comment #58) > FWIW, I was unable to reproduce the problem using your script and the 3p1-rock > lp branch from where the original fixes came from. After many requests, the > script quits with "now reading remote header" and "Alarm clock" messages on the > console. Squid keeps running. Perhaps my Squid configuration, code, or > environment differs too much. > > Should any origin server (without any special configuration) work for your test > cases? Does it need to have some PUT/POST-accepting CGI? For me, squid leaks without the origin server having any special configuration/content, it returns 404 for all requests. I have another valgrind trace and cache.log at http://www.sfritsch.de/~stf/squid/run4/ . All interesting 'definitely' lost records have 'HttpReply::operator new' or 'HttpRequest::operator new' in the trace. > If I have the time, I will try stock v3.1.4. Meanwhile, you may want to post > more details about the Squid exit in your environment (error messages, > backtrace, etc.). I think this happens when squid gets ENOSPC when writing the log files.
(In reply to comment #60) > > Meanwhile, you may want to post > > more details about the Squid exit in your environment (error messages, > > backtrace, etc.). > > I think this happens when squid gets ENOSPC when writing the log files. Running out of space is unrelated to this bug. > All interesting 'definitely' lost > records have 'HttpReply::operator new' or 'HttpRequest::operator new' in the > trace. I will try to reproduce the memory leakage using your "run4" logs and the "stopped ICAP server" condition.
(In reply to comment #61) > > All interesting 'definitely' lost > > records have 'HttpReply::operator new' or 'HttpRequest::operator new' in the > > trace. > > I will try to reproduce the memory leakage using your "run4" logs and the > "stopped ICAP server" condition. I was able to reproduce most (possibly all) "definitive" leaks you observed and fixed those related to ICAP. None of the fixed leaks are caused by the patch proposed for this bug though. The leaks were there before the patch. Thus, I opened a new bug report to accumulate those leak fixes. Please see bug #2964 for the current state. Let's keep this bug dedicated to the "pure virtual method called" termination.
(In reply to comment #58) > Meanwhile, you may want to post > more details about the Squid exit in your environment (error messages, > backtrace, etc.). I tried to reproduce these errors today myself and failed, so I assume that the bugs are gone now. Unfortunately I have installed some more patches in the meantime so I can't tell if its from the patches or from some error in my test environment. Anyway, thanks for taking the time.
Created attachment 2217 [details] Updated patch for 3.1.4 The last patch for 3.1.4 was missing three replacements asyncCall -> CommCallBack in client_side.cc. Therefore it didn't compile under openbsd with gcc 3.x (don't know why it compiled under Linux with gcc 4.x, though).
Is Bug #2774 a dup of this, or is it two different pure virtual methods?
(In reply to comment #65) > Is Bug #2774 a dup of this, or is it two different pure virtual methods? We do not know, but it probably is a dupe. I asked for more info on bug #2774 but did not get a response.
A fix, based on the patches posted here, is now committed to trunk as r10779. Stefan, do you mind if I add you to the CONTRIBUTORS file? Do you want your email there?
Alex, his details will already be making their way in from earlier bug patch contributions. That will happen when the automatic http://www.squid-cache.org/Versions/v3/3.HEAD/changesets/AUTHORS file next gets synced with CONTRIBUTORS.
(In reply to comment #68) > Alex, his details will already be making their way in from earlier bug patch > contributions. That will happen when the automatic > http://www.squid-cache.org/Versions/v3/3.HEAD/changesets/AUTHORS file next gets > synced with CONTRIBUTORS. Sounds good. Since Stefan is not the author of the code committed for this bug, I did not realize his name will be automatically listed.
Bah. Wrong Stefan, Sorry for the noise.
(In reply to comment #67) > A fix, based on the patches posted here, is now committed to trunk as r10779. Thanks. Please add a note here when it gets backported to 3.1.x. > Stefan, do you mind if I add you to the CONTRIBUTORS file? Do you want your > email there? I certainly don't mind being listed there including my mail address. FWIW my debugging activities for squid are sponsored by GeNUA mbH.
Applied to 3.2 and 3.1. NP: if someone can confirm the next 3.1 snapshot works properly with it in there will be an early 3.1.8.
*** Bug 2767 has been marked as a duplicate of this bug. ***
*** Bug 3041 has been marked as a duplicate of this bug. ***
*** Bug 2774 has been marked as a duplicate of this bug. ***