Bug 2583 - pure virtual method called
Summary: pure virtual method called
Status: RESOLVED FIXED
Alias: None
Product: Squid
Classification: Unclassified
Component: other (show other bugs)
Version: 3.1
Hardware: All All
: P2 major
Assignee: Alex Rousskov
URL:
: 2767 2774 3041 (view as bug list)
Depends on:
Blocks:
 
Reported: 2009-01-31 08:58 UTC by Christos Tsantilas
Modified: 2012-09-21 23:09 UTC (History)
9 users (show)

See Also:
Browser: ---
Fixed Versions:
Needs:


Attachments
a patch solving the problem (408 bytes, patch)
2009-01-31 10:34 UTC, Christos Tsantilas
Details
patch for moving the validation of cbdata from Adaptation::Initiator::announceInitiatorAbort to the caller (772 bytes, patch)
2009-03-10 16:22 UTC, Christos Tsantilas
Details
take5 (65.63 KB, patch)
2009-09-22 00:23 UTC, Alex Rousskov
Details
take6 (69.62 KB, patch)
2009-09-23 23:40 UTC, Alex Rousskov
Details
log for squid-13 with patch. (126.94 KB, application/x-zip-compressed)
2009-10-09 04:20 UTC, yaopeng99
Details
fix for uncaught exception l (818 bytes, patch)
2009-10-21 10:39 UTC, Alex Rousskov
Details
updated patch for 3.1.4 (584.61 KB, patch)
2010-06-08 07:53 UTC, Stefan Fritsch
Details
memory leak report from valgrind (1.57 KB, text/plain)
2010-06-08 07:55 UTC, Stefan Fritsch
Details
patch for 3.1.4 with the .orig files removed (76.26 KB, patch)
2010-06-09 01:56 UTC, Stefan Fritsch
Details
small script to trigger problems (2.59 KB, application/x-perl)
2010-06-10 09:56 UTC, Georg
Details
Updated patch for 3.1.4 (75.77 KB, patch)
2010-07-09 09:43 UTC, Stefan Fritsch
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christos Tsantilas 2009-01-31 08:58:10 UTC
Squid3-trunk crashed with the message:

pure virtual method called
terminate called without an active exception
Aborted (core dumped)


backtrace:
 
#0  0xb7fa7430 in __kernel_vsyscall ()                                          
(gdb) backtrace                                                                 
#0  0xb7fa7430 in __kernel_vsyscall ()                                          
#1  0xb79a3880 in raise () from /lib/tls/i686/cmov/libc.so.6                    
#2  0xb79a5248 in abort () from /lib/tls/i686/cmov/libc.so.6                    
#3  0xb7bc86f8 in __gnu_cxx::__verbose_terminate_handler ()                     
   from /usr/lib/libstdc++.so.6                                                 
#4  0xb7bc65d5 in ?? () from /usr/lib/libstdc++.so.6                            
#5  0xb7bc6612 in std::terminate () from /usr/lib/libstdc++.so.6                
#6  0xb7bc72b5 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6            
#7  0x081f2a05 in JobDialer (this=0xbf9a8270, aJob=0x8df64d8)                   
    at ICAP/AsyncJob.cc:172                                                     
warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.)  

#8  0x08240f27 in NullaryMemFunT (this=warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.)                                           

warning: (Internal error: pc 0x8240ee2 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.)

0xbf9a8270, anObject=warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.)                                                             

0x8df64b0, aMethod=warning: (Internal error: pc 0x8240f26 in read in psymtab, but not in symtab.)                                                               


&virtual table offset 8) at ../../src/AsyncJobCalls.h:33
warning: (Internal error: pc 0x8240f88 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.)

#9  0x08240f88 in MemFun<Adaptation::Initiate> (object=warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.)                           

warning: (Internal error: pc 0x8240f53 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.)

0x8df64b0, method=warning: (Internal error: pc 0x8240f87 in read in psymtab, but not in symtab.)                                                                


&virtual table offset 8) at ../../src/AsyncJobCalls.h:76
#10 0x08240d2b in Adaptation::Initiator::announceInitiatorAbort (
    this=0x8dde4f8, x=@0x8dde85c) at Initiator.cc:28
#11 0x0810f5af in ~ClientHttpRequest (this=0x8dde4f8)
    at client_side_request.cc:263
#12 0x080faad7 in ~ClientSocketContext (this=0xb6faf0c0) at client_side.cc:256
#13 0x080fb088 in ConnStateData::freeAllContexts (this=0x8dd5e04)
    at ../include/RefCount.h:100
#14 0x080fd1ec in ConnStateData::swanSong (this=0x8dd5e04)
    at client_side.cc:617
---Type <return> to continue, or q <return> to quit---
#15 0x081f3df1 in AsyncJob::callEnd (this=0x8dd5ee4) at ICAP/AsyncJob.cc:135
#16 0x081f30c9 in JobDialer::dial (this=0x8beced4, call=@0x8beceb8)
    at ICAP/AsyncJob.cc:222
#17 0x08106958 in AsyncCallT<CommCbMemFunT<ConnStateData, CommCloseCbParams> >::fire (this=0x8beceb8) at AsyncCall.h:129
#18 0x080e2e26 in AsyncCall::make (this=0x8beceb8) at AsyncCall.cc:34
#19 0x080e2176 in AsyncCallQueue::fireNext (this=0x8b9ee50)
    at AsyncCallQueue.cc:53
#20 0x080e2320 in AsyncCallQueue::fire (this=0x8b9ee50) at AsyncCallQueue.cc:39
#21 0x0814bb47 in EventLoop::runOnce (this=0xbf9a8514) at EventLoop.cc:131
#22 0x0814bc10 in EventLoop::run (this=0xbf9a8514) at EventLoop.cc:95
#23 0x081971a3 in main (argc=5, argv=0xbf9a8604) at main.cc:1350
(gdb)
Comment 1 Christos Tsantilas 2009-01-31 09:30:59 UTC
Some more info:

(gdb)frame 7
#7  0x081f2a05 in JobDialer (this=0xbf9a8270, aJob=0x8df64d8)
    at ICAP/AsyncJob.cc:172
172             lock = cbdataReference(aJob->toCbdata());
(gdb) print *aJob
$1 = {_vptr.AsyncJob = 0x82cee28, stopReason = 0x0,
  typeName = 0x8288052 "ICAPModXactLauncher", inCall = {p_ = 0x0}, id = 22958,
  static TheLastId = 22959}
(gdb)
Comment 2 Christos Tsantilas 2009-01-31 10:34:29 UTC
Created attachment 1898 [details]
a patch solving the problem


Inside the Adaptation::Initiator::announceInitiatorAbort method  the x==NULL check is not enough. We must also check if the x variable (of type Initiate) is cbdataValid.
Comment 3 Christos Tsantilas 2009-01-31 10:47:17 UTC
Applied to trunk
Comment 4 Amos Jeffries 2009-02-02 04:18:15 UTC
Applied to Squid 3.1
Comment 5 Alex Rousskov 2009-02-13 11:41:18 UTC
I think the patch masks the problem but does not solve it. We should not be testing whether x is cbdata-valid here unless we want to test it everywhere (and we do not want that!). 

Adaptation::Initiator::announceInitiatorAbort does not dereference x and, hence, should not care whether x is valid.

Can you find which function or macro are dereferencing x and fix _that_ code, please?

Thank you.
Comment 6 Christos Tsantilas 2009-02-13 12:04:30 UTC
(In reply to comment #5)
> I think the patch masks the problem but does not solve it. We should not be
> testing whether x is cbdata-valid here unless we want to test it everywhere
> (and we do not want that!). 
OK
> 
> Adaptation::Initiator::announceInitiatorAbort does not dereference x and,
> hence, should not care whether x is valid.
OK I understand.
> 
> Can you find which function or macro are dereferencing x and fix _that_ code,
> please?

I will try. I need to find the test case again.

> 
> Thank you.
> 

Comment 7 Christos Tsantilas 2009-02-13 12:11:36 UTC
(In reply to comment #5)
> Adaptation::Initiator::announceInitiatorAbort does not dereference x and,
> hence, should not care whether x is valid.

However this is not absolutely true. It calls the Adaptation::Initiator::clearAdaptation which do a cbdataReferenceDone(x).
Comment 8 Alex Rousskov 2009-02-13 12:50:36 UTC
(In reply to comment #7)
> (In reply to comment #5)
> > Adaptation::Initiator::announceInitiatorAbort does not dereference x and,
> > hence, should not care whether x is valid.
> 
> However this is not absolutely true. It calls the
> Adaptation::Initiator::clearAdaptation which do a cbdataReferenceDone(x).

Again, we should not check what we do not use. Adaptation::Initiator::announceInitiatorAbort does not use *x and, hence, should not care whether the object pointed by x is valid.

Now, you are right that Adaptation::Initiator::clearAdaptation calls cbdataReferenceDone(x). But clearAdaptation does not dereference x either!

I am trying to prevent two distinct things from happening here:

1) Somebody looking at the patched code and deciding that they always need to check whether a pointer to the job is valid before sending a message to that job. A year later, somebody else looks at all those checks and decides that when the message is being sent, the job must be valid and writes the code with that incorrect assumption in mind. Two years after that, the AsyncCalls would need to be replaced because the old code becomes unusable due to a mix wrong assumptions.

2) There is clearly some buggy code that dereferences a job pointer without checking its validity. The proposed fix masks that bug. The bug will remain in the code and cause more problems down the road. In fact, there could be a bigger design problem lurking here (see my rant below).


My unverified suspicion is that the problem is with AsyncJob::toCbdata() hack. We should not access virtual x->toCbdata if x is invalid but we need that access to check whether x is invalid. I think this has been discussed somewhere. I need to research it, but do not let that block you!

BTW, in the committed fix, how do you know that "(void*)x" is a cbdata pointer? Could it point to some other part of the cbdata-protected object? This was the reason we added toCbdata(), IIRC.

Comment 9 Christos Tsantilas 2009-02-13 13:23:52 UTC
(In reply to comment #8)
> 
> My unverified suspicion is that the problem is with AsyncJob::toCbdata() hack.
> We should not access virtual x->toCbdata if x is invalid but we need that
> access to check whether x is invalid. I think this has been discussed
> somewhere. I need to research it, but do not let that block you!

I think this was the first think I checked when I looked into this bug. But I did not found any wrong with cbdata.

> 
> BTW, in the committed fix, how do you know that "(void*)x" is a cbdata pointer?
> Could it point to some other part of the cbdata-protected object? This was the
> reason we added toCbdata(), IIRC.
> 
I can not  answer this question right now but as I can remember I had check carefully related cases. I remembered the problems we had with cbdata and the reason we added toCbdata, and this was my first thought about the problem (I paniced).
The cbdata pointer passed was the correct pointer just the data was corrupted.
But maybe I am loosing something. And even if the problem is just "not cbdata valid data" it is good to know the exact reason.
Comment 10 Alex Rousskov 2009-02-13 13:32:55 UTC
If the error was caused by

  cbdataReference(aJob->toCbdata())

and the fix is checking, essentially, 

  cbdataReferenceValid(aJob)

then the fix may not just mask a bug, but is a bug by itself because aJob may not be a cbdata pointer.

Another possibility is that x is not a job pointer, of course.

Either way, the combination of the above two calls is invalid.

Can you reproduce this bug on demand?
Comment 11 Christos Tsantilas 2009-02-13 14:18:44 UTC
(In reply to comment #10)
> Can you reproduce this bug on demand?
> 
Nope. It was start happened again and again while browsing one site. I can not reproduce it any more browsing the same site. Maybe it was related with icap server problems. Xmm...
Comment 12 Christos Tsantilas 2009-02-15 14:52:15 UTC
I can not reproduce this bug.
I think, it is still here because I hit an assertion I put in Adaptation::Initiator::announceInitiatorAbort, but only once and the core file was bad, with no debug info.
I did not keep the debug info when I discover this bug. I believed that it was easy to reproduce it (@@#$&^%#@?!)...

What should we do now? Is it OK to add an assertion checking in ClientHttpRequest::~ClientHttpRequest() just before we call the announceInitiatorAbort method?
The "pure virtual method called" message is bad.....
Comment 13 Alex Rousskov 2009-02-16 09:54:44 UTC
(In reply to comment #12)
> What should we do now? Is it OK to add an assertion checking in
> ClientHttpRequest::~ClientHttpRequest() just before we call the
> announceInitiatorAbort method?
> The "pure virtual method called" message is bad.....

Adding a correct assertion is a good idea, of course. Just do not add it to the function that does not dereference x. Add it to the code that does. Perhaps it will be even triggered sooner there.

It would be terrific if you could verify that we always

1) Use foo->toCbdata() when converting an AsyncJob to a cbdata-protected void* pointer and

2) Use static_cast<AsyncJob*>(cbdata) when converting a cbdata-protected void* pointer to a job. It is probably a good idea to add static AsyncJob::FromCbdata(void*) method for this purpose, BTW. The same static method could check void* pointer validity before the conversion and throw or return NULL if the pointer is invalid. 

If the above two rules are always followed, we should not run into the bug reported here. As you can guess from the above discussion, I am worried that in the current code, 

a) x is not always x->toCbdata() or
b) void* pointer is converted to an AsyncJob child without going through the AsyncJob or
c) void* pointer is converted to an AsyncJob child when the pointer is not cbdata-valid.

HTH.
Comment 14 Christos Tsantilas 2009-03-10 16:22:28 UTC
Created attachment 1925 [details]
patch for moving the validation of  cbdata from Adaptation::Initiator::announceInitiatorAbort  to the caller

This patch removes the cbdata validation from the  Adaptation::Initiator::announceInitiatorAbort method and adds a related check inside the ClientHttpRequest::~ClientHttpRequest destructor just before call the Adaptation::Initiator::announceInitiatorAbort method.
Comment 15 Christos Tsantilas 2009-03-10 16:34:39 UTC
(In reply to comment #13)

> Adding a correct assertion is a good idea, of course. Just do not add it to the
> function that does not dereference x. Add it to the code that does. Perhaps it
> will be even triggered sooner there.

it is OK the patch in attachment (id=1925)

>.......
> reported here. As you can guess from the above discussion, I am worried that in
> the current code, 
> 
> a) x is not always x->toCbdata() or
I do not think that this was the problem. I am sure I had test it and the x was the x->toCbdata ....

> b) void* pointer is converted to an AsyncJob child without going through the
> AsyncJob or
> c) void* pointer is converted to an AsyncJob child when the pointer is not
> cbdata-valid.

Looks possible.
But it is very difficult for me to check it without reproducing it.
This bug disappeared and I did not hit it again.
I am not planning to work more on this bug at least until I can reproduce it.

Comment 16 Christos Tsantilas 2009-04-27 14:23:06 UTC
The patch in attachment (id=1925) applied to trunk
Comment 17 Alex Rousskov 2009-04-27 16:00:23 UTC
(In reply to comment #16)
> The patch in attachment (id=1925) [details] applied to trunk

I [still] disagree with this patch because it asserts something that does not have to be true. The virginHeadSource job pointer may become cbdata-invalid if that job ends before we have a chance to call it back. Why are we asserting that it is valid?

It is possible that AsyncJobDialer needs to be fixed to avoid calling cbdataReference in its constructor if aJob is cbdata-invalid:

JobDialer::JobDialer(AsyncJob *aJob): job(NULL), lock(NULL)
{
    if (aJob) {
        lock = cbdataReference(aJob->toCbdata());
        job = aJob;
    }
}

However, this brings us back to my original concern: If a job pointer becomes cbdata-invalid, we cannot call aJob->toCbdata(). On the other hand, we must call aJob->toCbdata() to find out whether the job pointer is cbdata-valid.

Why did you apply the patch?

Thank you,

Alex.
Comment 18 Christos Tsantilas 2009-04-28 08:44:56 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > The patch in attachment (id=1925) [details] [details] applied to trunk
> 
> I [still] disagree with this patch because it asserts something that does not
> have to be true. The virginHeadSource job pointer may become cbdata-invalid if
> that job ends before we have a chance to call it back. Why are we asserting
> that it is valid?
> 
> It is possible that AsyncJobDialer needs to be fixed to avoid calling
> cbdataReference in its constructor if aJob is cbdata-invalid:
> 
> JobDialer::JobDialer(AsyncJob *aJob): job(NULL), lock(NULL)
> {
>     if (aJob) {
>         lock = cbdataReference(aJob->toCbdata());
>         job = aJob;
>     }
> }
> 
> However, this brings us back to my original concern: If a job pointer becomes
> cbdata-invalid, we cannot call aJob->toCbdata(). On the other hand, we must
> call aJob->toCbdata() to find out whether the job pointer is cbdata-valid.
> 
> Why did you apply the patch?

The first patch, which was  not correct, never removed from trunk. So I remove it and I just add the assertion to have a better crash than the "pure virtual method called" message.
OK, I will remove the assertion too.

Comment 19 Amos Jeffries 2009-05-03 05:39:53 UTC
Reverted from 3.1
Comment 20 Alex Rousskov 2009-05-11 18:13:08 UTC
I believe Christos have inadvertently confirmed my fears while working on an ICAP project. Here is the summary:

AsyncJob::callEnd deletes the job if the job is completed. If there are no cbdataReference "locks" on the job pointer, the job state (including cbdata!) is destroyed. After that, calling the job methods or even dereferencing the job pointer becomes unsafe.

We should always hold the lock on the job (via cbdataReference) if we want to communicate with the job in the future. We cannot just store a job pointer.
In fact, pointers to jobs in general code should be outlawed.

We cannot safely call job->cbdata() if the job has been freed
and, hence, we cannot know job's cbdata pointer! We have to store
both the cbdata pointer and job pointer to be able to communicate
with the job.

This is very similar to what JobDialer already does, but we cannot create JobDialer when the transaction may have already been destroyed! We must create it (or a similar class) _before_ that happens. 

The right general solution here may be a smart JobPointer class that solves this problem by storing an AsyncJob pointer and a cbdata pointer, just like JobDialer
does. Regular Job pointers must be replaced with JobPointers.
Comment 21 Robert Collins 2009-05-11 18:24:51 UTC
> We cannot safely call job->cbdata() if the job has been freed
> and, hence, we cannot know job's cbdata pointer! We have to store
> both the cbdata pointer and job pointer to be able to communicate
> with the job.

If cbdata is a non virtual method it can be called safely at any time; this will be the value of the pointer, and can do validation as needed.


Comment 22 Alex Rousskov 2009-05-11 20:37:16 UTC
(In reply to comment #21)
> > We cannot safely call job->cbdata() if the job has been freed
> > and, hence, we cannot know job's cbdata pointer! We have to store
> > both the cbdata pointer and job pointer to be able to communicate
> > with the job.
> 
> If cbdata is a non virtual method it can be called safely at any time; this
> will be the value of the pointer, and can do validation as needed.

cbdata() is virtual.
Comment 23 Alex Rousskov 2009-05-11 21:26:42 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > > We cannot safely call job->cbdata() if the job has been freed
> > > and, hence, we cannot know job's cbdata pointer! We have to store
> > > both the cbdata pointer and job pointer to be able to communicate
> > > with the job.
> > 
> > If cbdata is a non virtual method it can be called safely at any time; this
> > will be the value of the pointer, and can do validation as needed.
> 
> cbdata() is virtual.

Just wanted to clarify in case somebody tries to use this bug report as a documentation piece :-)

AsyncJob::toCbdata() is virtual. It returns the pointer to object's cbdata. That cbdata pointer is not always the same as "this" pointer, due to multiple inheritance.

AsyncJob is not a CBDATA_CLASS but its kids are. Thus, AsyncJob kids get a virtual toCbdata() even though the CBDATA_CLASS macros do not explicitly declare toCbdata() as virtual. Any specific job class must be a CBDATA_CLASS so that the right cbdata-specific new/delete methods are called, but I am not sure that is enforced.

All this ugliness is meant to work around the incompatibilities between a C hack (cbdata) and C++ multiple inheritance. Cbdata-compatible improvement suggestions are very welcome.

Hopefully, we will improve or replace cbdata API in Squid v3.3 to avoid this mess.
Comment 24 Robert Collins 2009-05-11 21:56:28 UTC
Is there any reason we can't use a template rather than subclassing to get the specialesed cbdata information? 
Comment 25 Alex Rousskov 2009-05-11 22:39:39 UTC
(In reply to comment #24)
> Is there any reason we can't use a template rather than subclassing to get the
> specialesed cbdata information? 

Sorry, I am not sure what you mean. Which subclasses do you want to replace and with what template?
Comment 26 Alex Rousskov 2009-09-21 14:54:40 UTC
Looks like this bug is still in v3.1: I saw a few recent crashes with similar symptoms. I am working on a fix.

I tried to assign the bug to myself but it looks like the new bugzilla makes that impossible.
Comment 27 lijun422 2009-09-21 19:27:04 UTC
(In reply to comment #26)
> Looks like this bug is still in v3.1: I saw a few recent crashes with similar
> symptoms. I am working on a fix.
> I tried to assign the bug to myself but it looks like the new bugzilla makes
> that impossible.

We are doing performance test of squid and our icap server, and this problem occurs from time to time. 
We use loadrunner to send the same request to squid, up to 2500 tps. The bug can be reproduced within hours.
Looking forward to the fix.
Comment 28 Alex Rousskov 2009-09-21 21:58:35 UTC
(In reply to comment #27)
> The bug can be reproduced within hours.
> Looking forward to the fix.

Would you be interested in and able to test preliminary patches for this bug? I will post them here if you will test.

Thank you.
Comment 29 lijun422 2009-09-21 22:43:43 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > The bug can be reproduced within hours.
> > Looking forward to the fix.
> Would you be interested in and able to test preliminary patches for this bug? I
> will post them here if you will test.
> Thank you.

Pleasure to help to test.
Comment 30 Alex Rousskov 2009-09-22 00:23:40 UTC
Created attachment 2052 [details]
take5

The attached patch is our current diff against Squid3 trunk. It might apply to Squid v3.1. You should bootstrap because we added on header file. These changes have not been tested yet so the patch may crash your Squid immediately.

Please keep me posted on your testing progress.
Comment 31 lijun422 2009-09-22 01:15:54 UTC
(In reply to comment #30)
> Created an attachment (id=2052) [details]
> take5
> The attached patch is our current diff against Squid3 trunk. It might apply to
> Squid v3.1. You should bootstrap because we added on header file. These changes
> have not been tested yet so the patch may crash your Squid immediately.
> Please keep me posted on your testing progress.

Ok. I'll apply it to 3.1.0.13 and post test result here.
Comment 32 lijun422 2009-09-22 22:07:57 UTC
(In reply to comment #31)
> (In reply to comment #30)
> > Created an attachment (id=2052) [details] [details]
> > take5
> > The attached patch is our current diff against Squid3 trunk. It might apply to
> > Squid v3.1. You should bootstrap because we added on header file. These changes
> > have not been tested yet so the patch may crash your Squid immediately.
> > Please keep me posted on your testing progress.
> Ok. I'll apply it to 3.1.0.13 and post test result here.

It crashed immediately after receiving a request.

Segmentation fault

Here is the last part of cache.log:
AcceptFD::acceptOne accepted: FD 13 newfd: 9 from: 150.236.80.251:5945 handler: SomeCommAcceptHandler(FD -1, data=0x9e5f0d8)
2009/09/23 12:03:01.756| comm.cc(2324) will call SomeCommAcceptHandler(FD 13, data=0x9e5f0d8, newFD 9) [call5]
2009/09/23 12:03:01.756| entering SomeCommAcceptHandler(FD 13, data=0x9e5f0d8, newFD 9)
2009/09/23 12:03:01.756| AsyncCall.cc(32) make: make call SomeCommAcceptHandler [call5]
2009/09/23 12:03:01.756| comm_accept: FD 13 handler: 0x80dcd00
2009/09/23 12:03:01.756| The AsyncCall SomeCommAcceptHandler constructed, this=0xa117d40 [call20]
2009/09/23 12:03:01.756| comm.cc(2289) acceptOne: try later: FD 13 handler: SomeCommAcceptHandler(FD -1, data=0x9e5f0d8)
2009/09/23 12:03:01.756| httpAccept: FD 9: accepted
2009/09/23 12:03:01.756| AsyncJob of type ConnStateData constructed, this=0xa117f6c [async1]
2009/09/23 12:03:01.756| The AsyncCall ConnStateData::connStateClosed constructed, this=0xa11be50 [call21]
2009/09/23 12:03:01.756| comm_add_close_handler: FD 9, AsyncCall=0xa11be50*1
2009/09/23 12:03:01.756| The AsyncCall ConnStateData::requestTimeout constructed, this=0xa11bea0 [call22]
2009/09/23 12:03:01.756| comm.cc(1185) commSetTimeout: FD 9 timeout 900
2009/09/23 12:03:01.756| clientReadSomeData: FD 9: reading request...
2009/09/23 12:03:01.756| The AsyncCall ConnStateData::clientReadRequest constructed, this=0xa11bef0 [call23]
2009/09/23 12:03:01.756| comm_read, queueing read for FD 9; asynCall 0xa11bef0*1
2009/09/23 12:03:01.756| leaving SomeCommAcceptHandler(FD 13, data=0x9e5f0d8, newFD 9)
2009/09/23 12:03:01.757| comm_read_try: FD 9, size 4095, retval 574, errno 0
2009/09/23 12:03:01.757| commio_finish_callback: called for FD 9 (0, 0)
2009/09/23 12:03:01.757| comm.cc(163) will call ConnStateData::clientReadRequest(FD 9, data=0xa117e58, size=574, buf=0xb7e0d008) [call23]
2009/09/23 12:03:01.757| entering ConnStateData::clientReadRequest(FD 9, data=0xa117e58, size=574, buf=0xb7e0d008)
2009/09/23 12:03:01.757| AsyncCall.cc(32) make: make call ConnStateData::clientReadRequest [call23
Comment 33 Alex Rousskov 2009-09-22 23:03:20 UTC
Thank you for testing! There will be a better version of the patch available within 24 hours.

Did the patch apply cleanly to v3.1.0.13?
Comment 34 lijun422 2009-09-22 23:51:26 UTC
(In reply to comment #33)
> Thank you for testing! There will be a better version of the patch available
> within 24 hours.
> Did the patch apply cleanly to v3.1.0.13?

Yes. I applied it to squid-3.1.0.13-20090917.
Comment 35 Alex Rousskov 2009-09-23 23:40:25 UTC
Created attachment 2053 [details]
take6

This patch against trunk r9997 survived some mild abuse but more testing is needed. Bootstrapping is still required. Please test.
Comment 36 lijun422 2009-09-23 23:49:01 UTC
(In reply to comment #35)
> Created an attachment (id=2053) [details]
> take6
> This patch against trunk r9997 survived some mild abuse but more testing is
> needed. Bootstrapping is still required. Please test.

Ok.
Comment 37 lijun422 2009-09-24 22:07:49 UTC
(In reply to comment #36)
> (In reply to comment #35)
> > Created an attachment (id=2053) [details] [details]
> > take6
> > This patch against trunk r9997 survived some mild abuse but more testing is
> > needed. Bootstrapping is still required. Please test.
> Ok.

I applied the new patch to squid 3.1.0.13(release candidate released on 4th August) and did some basic test. So far the result is good. 
We plan to do performance test on Sunday since our new test environment is not ready until then. Hopefully we'll get performance result next week and we'll keep you updated.

Thanks.
Comment 38 yaopeng99 2009-09-25 00:07:10 UTC
(In reply to comment #37)
> (In reply to comment #36)
> > (In reply to comment #35)
> > > Created an attachment (id=2053) [details] [details] [details]
> > > take6
> > > This patch against trunk r9997 survived some mild abuse but more testing is
> > > needed. Bootstrapping is still required. Please test.
> > Ok.
> I applied the new patch to squid 3.1.0.13(release candidate released on 4th
> August) and did some basic test. So far the result is good. 
> We plan to do performance test on Sunday since our new test environment is not
> ready until then. Hopefully we'll get performance result next week and we'll
> keep you updated.
> Thanks.

I will be in charge of performance test and I will update performance result to you.
Comment 39 yaopeng99 2009-09-29 23:38:41 UTC
(In reply to comment #35)
> Created an attachment (id=2053) [details]
> take6
> This patch against trunk r9997 survived some mild abuse but more testing is
> needed. Bootstrapping is still required. Please test.

We have applied this patch to squid-3.1.0.13 and test it.
It didn't crash. But we see some icap broken exception in log in performance test up to 400 tps. Log is as follows:

2009/09/30 12:23:55.845| entering AsyncJob::noteStart()
2009/09/30 12:23:55.845| AsyncCall.cc(32) make: make call AsyncJob::noteStart [call832772]
2009/09/30 12:23:55.845| Adaptation::Icap::ModXact status in: [G/R icapx211440]
2009/09/30 12:23:55.845| ModXact.cc(1541) estimateVirginBody: does not expect virgin body
2009/09/30 12:23:55.845| ModXact.cc(98) waitForService: will wait for the ICAP service [G/R icapx211440]
2009/09/30 12:23:55.845| The AsyncCall Adaptation::Icap::ModXact::noteServiceReady constructed, this=0xe09900 [call832967]
2009/09/30 12:23:55.845| ServiceRep.cc(201) callWhenReady: Adaptation::Icap::Service is asked to call Adaptation::Icap::ModXact::noteServiceReady() when ready [down,susp,stale,fail11]
2009/09/30 12:23:55.845| ../../src/base/AsyncJob.h(151) dial: AsyncJob::noteStart threw exception: !broken()
2009/09/30 12:23:55.845| Xaction.cc(438) setOutcome: ICAP_ERR_OTHER
2009/09/30 12:23:55.845| Adaptation::Icap::ModXact will stop, reason: exception
2009/09/30 12:23:55.845| AsyncJob::noteStart() ends job [UG/R icapx211440]
2009/09/30 12:23:55.845| ModXact.cc(1095) swanSong: swan sings [UG/R icapx211440]
2009/09/30 12:23:55.845| ModXact.cc(443) stopWriting: will no longer write [UG/R icapx211440]
2009/09/30 12:23:55.845| ModXact.cc(550) stopSending: Enter stop sending 
2009/09/30 12:23:55.845| ModXact.cc(553) stopSending: Proceed with stop sending 
2009/09/30 12:23:55.845| ModXact.cc(565) stopSending: will not start sending [UG/Rw icapx211440]
2009/09/30 12:23:55.845| Initiate.cc(58) swanSong: swan sings [UG/RwS icapx211440]
2009/09/30 12:23:55.845| Initiate.cc(65) swanSong: swan sang [UG/RwS icapx211440]
2009/09/30 12:23:55.845| Adaptation::Icap::ModXact destructed, this=0xf4a638 [icapx211440]
2009/09/30 12:23:55.845| AsyncJob.cc(139) callEnd: AsyncJob::noteStart() ended 0xf4abf0
2009/09/30 12:23:55.845| leaving AsyncJob::noteStart()


We do same performance test to original squid-3.1.0.13 without patch. It crashed, we don't find this exception log in it.

Is this exception a bug of this patch? 

Now we test it in a temporary test environment and tps can't be very high. We will start to test it in normal test environment on Oct.10th.
Comment 40 Alex Rousskov 2009-09-30 02:16:01 UTC
> Adaptation::Icap::Service is asked to call
> Adaptation::Icap::ModXact::noteServiceReady() when ready
> [down,susp,stale,fail11]
> 2009/09/30 12:23:55.845| ../../src/base/AsyncJob.h(151) dial:
> AsyncJob::noteStart threw exception: !broken()

The above smells like a regular exception associated with a down, suspended ICAP service. The service failed 11 times and got suspended because of those failures. ICAP transactions do not wait for suspended services (because there is usually no point to wait as the service will remain unusable for some time).

I cannot be sure, but this problem does not look directly related to the patch unless there were/are some other bugs that the fix in take6 exposed.

Any signs of trouble on the ICAP server side?

Do you have a longer log showing ICAP OPTIONS exchange or other ICAP failures?

Thank you.
Comment 41 yaopeng99 2009-10-09 04:20:28 UTC
Created attachment 2059 [details]
log for squid-13 with patch.

In the attachment log file, we can see there are 11 exception of cannot connect to the ICAP service and then throw broken exception. So I think maybe it is a regular exception. We don't meet these log in original squid-13 because it dump before icap down. We will do more tests to watch it.

Thanks.
Comment 42 Alex Rousskov 2009-10-09 08:56:32 UTC
The ICAP exceptions in your log look correct to me. Your Squid is running out of file descriptors and, hence, has trouble opening new connections for ICAP requests:

comm_open: socket failure: (24) Too many open files
Xaction.cc(218) dieOnConnectionFailure: Adaptation::Icap::ModXact failed to connect to icap://150.236.238.164:1345/request

This goes own until the ICAP service is marked as down after 10 failures: 

ServiceRep.cc(74) noteFailure:  failure 11 out of 10 allowed [up,fail11]
2009/09/30 12:22:31.867| suspending ICAP service for too many failures

After that, new ICAP transactions throw !broken exceptions, which is correct behavior:

dial: AsyncJob::noteStart threw exception: !broken()
Xaction.cc(438) setOutcome: ICAP_ERR_OTHER
Adaptation::Icap::ModXact will stop, reason: exception

This should stop after Squid tries to reach the the ICAP service again and has enough descriptors to do it.

I would not be surprised if the "running out of descriptors" condition was the trigger for the bug we are trying to fix here because it leads to ICAP transaction destruction during transaction startup. I bet the unpatched code cannot handle that reliably.
Comment 43 yaopeng99 2009-10-19 01:32:14 UTC
(In reply to comment #42)
> The ICAP exceptions in your log look correct to me. Your Squid is running out
> of file descriptors and, hence, has trouble opening new connections for ICAP
> requests:
> comm_open: socket failure: (24) Too many open files
> Xaction.cc(218) dieOnConnectionFailure: Adaptation::Icap::ModXact failed to
> connect to icap://150.236.238.164:1345/request
> This goes own until the ICAP service is marked as down after 10 failures: 
> ServiceRep.cc(74) noteFailure:  failure 11 out of 10 allowed [up,fail11]
> 2009/09/30 12:22:31.867| suspending ICAP service for too many failures
> After that, new ICAP transactions throw !broken exceptions, which is correct
> behavior:
> dial: AsyncJob::noteStart threw exception: !broken()
> Xaction.cc(438) setOutcome: ICAP_ERR_OTHER
> Adaptation::Icap::ModXact will stop, reason: exception
> This should stop after Squid tries to reach the the ICAP service again and has
> enough descriptors to do it.
> I would not be surprised if the "running out of descriptors" condition was the
> trigger for the bug we are trying to fix here because it leads to ICAP
> transaction destruction during transaction startup. I bet the unpatched code
> cannot handle that reliably.

We don't meet virtual dump problem after many performance tests. 
Thank you very much.
Comment 44 Alex Rousskov 2009-10-21 10:39:16 UTC
Created attachment 2065 [details]
fix for uncaught exception l

The attached patch should fix the following bug in patch take6:

15:42:40| dying from an unhandled exception: l
Comment 45 yaopeng99 2009-10-21 20:44:37 UTC
(In reply to comment #44)
> Created an attachment (id=2065) [details]
> fix for uncaught exception l
> The attached patch should fix the following bug in patch take6:
> 15:42:40| dying from an unhandled exception: l

OK, I will apply it to our squid.
Comment 46 Stefan Fritsch 2010-06-08 07:52:38 UTC
I can still reproduce these crashes with 3.1.4.

I have tried the patches (they can be made to apply with some tweaking) and the crashes disappear, but squid seems to be leaking memory rather badly with them.
Comment 47 Stefan Fritsch 2010-06-08 07:53:55 UTC
Created attachment 2180 [details]
updated patch for 3.1.4

this was the patch I used
Comment 48 Stefan Fritsch 2010-06-08 07:55:39 UTC
Created attachment 2181 [details]
memory leak report from valgrind

this is the biggest memory leak reported by valgrind (and the only significant one reported as 'definitely lost')
Comment 49 Alex Rousskov 2010-06-08 09:18:09 UTC
(In reply to comment #48)
> Created attachment 2181 [details]
> memory leak report from valgrind
> 
> this is the biggest memory leak reported by valgrind (and the only significant
> one reported as 'definitely lost')

Thank you for testing and updating the patch. Did you build Squid with --with-valgrind-debug to avoid false positives in valgrind reports?
Comment 50 Stefan Fritsch 2010-06-08 09:28:02 UTC
(In reply to comment #49)
> Thank you for testing and updating the patch. Did you build Squid with
> --with-valgrind-debug to avoid false positives in valgrind reports?

Yes, I have used --with-valgrind-debug
Comment 51 Alex Rousskov 2010-06-08 11:50:16 UTC
AFAICT, your valgrind report says that HttpRequest allocated inside Adaptation::Icap::ModXact::encapsulateHead() is leaking. That method calls both new and delete for HttpRequest. The only way a leak is possible in-between is if there is an exception thrown, but that should not happen (yes, we should use auto_ptr here).

Similar code is running in production without [significant] leaks, with ICAP enabled, but it is possible that it does not use the exact same code paths, of course.

Can you enable full debugging (debug_options ALL,9) and run Squid for a while so that the leak becomes obvious, posting both cache.log and valgrind log (all records not just the summary)?

Thank you.
Comment 52 Alex Rousskov 2010-06-08 11:53:31 UTC
(In reply to comment #47)
> Created attachment 2180 [details]
> updated patch for 3.1.4
> 
> this was the patch I used

If you have a chance, please remove irrelevant hunks from the updated patch. For example, there are a few large hunks comparing two .orig files.
Comment 53 Stefan Fritsch 2010-06-09 01:56:43 UTC
Created attachment 2182 [details]
patch for 3.1.4 with the .orig files removed

> If you have a chance, please remove irrelevant hunks from the updated patch.
> For example, there are a few large hunks comparing two .orig files.

Sorry, fixed.
Comment 54 Stefan Fritsch 2010-06-09 02:10:57 UTC
Thanks for your quick response.

(In reply to comment #51)
> AFAICT, your valgrind report says that HttpRequest allocated inside
> Adaptation::Icap::ModXact::encapsulateHead() is leaking. That method calls both
> new and delete for HttpRequest. The only way a leak is possible in-between is
> if there is an exception thrown, but that should not happen (yes, we should use
> auto_ptr here).

Actually, I was fuzzing squid with requests that are aborted at various stages or that may be invalid. If I use only ab for doing requests squid does not leak. Also, the ICAP server was not running during the tests.

I will try to provide more verbose logging later.
Comment 55 Stefan Fritsch 2010-06-09 07:52:37 UTC
Comment on attachment 2181 [details]
memory leak report from valgrind

It seems valgrind support was not active in squid while creating this because I didn't have valgrind installed at configure/compile time.
Comment 56 Georg 2010-06-10 09:50:08 UTC
I tested the provided patch (fixed version from sf) as follows:

Squid Cache: Version 3.1.4
configure options:  '--prefix=/usr/local/squid3.1' '--enable-icap-client' '--with-default-user=proxy' '--with-large-files' '--enable-follow-x-forwarded-for' '--without-pthreads' '--enable-ltdl-convenience' '--enable-storeio=ufs,aufs' '--with-squid=/home/user/squid-3.1.4' --with-squid=/home/user/squid-3.1.4

Config file is mostly standard, ICAP is _not_ enabled.

I have a test script to trigger the same requests as sf (attached). I
managed to trigger the 'pure virtual method' and in squid 3.1.3 i
managed to trigger bug #2896.

If I run that script right now I get (reliably) a premature exit with
exit code 1.
Comment 57 Georg 2010-06-10 09:56:20 UTC
Created attachment 2183 [details]
small script to trigger problems

A small hacky script to trigger bug 2583. Sends a number of (possibly incomplete)
requests to a webserver, any webserver should be fine. In my case its an
apache1.3 with a modperl extension that accepts any url.

You have to change the values $host and $proxy in the script.
Comment 58 Alex Rousskov 2010-06-10 11:20:11 UTC
FWIW, I was unable to reproduce the problem using your script and the 3p1-rock lp branch from where the original fixes came from. After many requests, the script quits with "now reading remote header" and "Alarm clock" messages on the console. Squid keeps running. Perhaps my Squid configuration, code, or environment differs too much.

Should any origin server (without any special configuration) work for your test cases? Does it need to have some PUT/POST-accepting CGI?

If I have the time, I will try stock v3.1.4. Meanwhile, you may want to post more details about the Squid exit in your environment (error messages, backtrace, etc.).
Comment 59 Stefan Fritsch 2010-06-11 04:13:29 UTC
I could not reproduce the big leak from comment 48 so far. But the logs at
http://www.sfritsch.de/~stf/squid/run2/ show some other leaks and some invalid memory accesses.

During the run I stopped the ICAP server while squid was under load. This lead to a significant increase in squid's VIRT and RSS memory size. Maybe the leaks shown by valgrind are related to the ICAP server being stopped. Squid (or valgrind?) continued to use 100% CPU after the load stopped. It still had 114 connections to the origin server open, despite the client script never having created more than 75 connections in parallel. After some minutes I have stopped the origin server too. After 1-2 more minutes, squid's CPU usage went down to 0% again.

The first leak report in the valgrind output was produced by calling mgr:mem after squid had settled to 0% CPU. The second one was created by squid's shutdown.

I will try to create more logs.
Comment 60 Stefan Fritsch 2010-06-18 08:22:58 UTC
(In reply to comment #58)
> FWIW, I was unable to reproduce the problem using your script and the 3p1-rock
> lp branch from where the original fixes came from. After many requests, the
> script quits with "now reading remote header" and "Alarm clock" messages on the
> console. Squid keeps running. Perhaps my Squid configuration, code, or
> environment differs too much.
> 
> Should any origin server (without any special configuration) work for your test
> cases? Does it need to have some PUT/POST-accepting CGI?

For me, squid leaks without the origin server having any special configuration/content, it returns 404 for all requests. 

I have another valgrind trace and cache.log at http://www.sfritsch.de/~stf/squid/run4/ . All interesting 'definitely' lost records have 'HttpReply::operator new' or 'HttpRequest::operator new' in the trace.

> If I have the time, I will try stock v3.1.4. Meanwhile, you may want to post
> more details about the Squid exit in your environment (error messages,
> backtrace, etc.).

I think this happens when squid gets ENOSPC when writing the log files.
Comment 61 Alex Rousskov 2010-06-18 09:54:33 UTC
(In reply to comment #60)
 
> > Meanwhile, you may want to post
> > more details about the Squid exit in your environment (error messages,
> > backtrace, etc.).
> 
> I think this happens when squid gets ENOSPC when writing the log files.

Running out of space is unrelated to this bug.

> All interesting 'definitely' lost
> records have 'HttpReply::operator new' or 'HttpRequest::operator new' in the
> trace.

I will try to reproduce the memory leakage using your "run4" logs and the "stopped ICAP server" condition.
Comment 62 Alex Rousskov 2010-06-22 09:22:50 UTC
(In reply to comment #61) 
> > All interesting 'definitely' lost
> > records have 'HttpReply::operator new' or 'HttpRequest::operator new' in the
> > trace.
> 
> I will try to reproduce the memory leakage using your "run4" logs and the
> "stopped ICAP server" condition.

I was able to reproduce most (possibly all) "definitive" leaks you observed and fixed those related to ICAP. None of the fixed leaks are caused by the patch proposed for this bug though. The leaks were there before the patch. Thus, I opened a new bug report to accumulate those leak fixes. Please see bug #2964 for the current state. 

Let's keep this bug dedicated to the "pure virtual method called" termination.
Comment 63 Georg 2010-06-29 08:45:46 UTC
(In reply to comment #58)
> Meanwhile, you may want to post
> more details about the Squid exit in your environment (error messages,
> backtrace, etc.).

I tried to reproduce these errors today myself and failed, so I assume
that the bugs are gone now. Unfortunately I have installed some more
patches in the meantime so I can't tell if its from the patches or from
some error in my test environment. 
Anyway, thanks for taking the time.
Comment 64 Stefan Fritsch 2010-07-09 09:43:39 UTC
Created attachment 2217 [details]
Updated patch for 3.1.4

The last patch for 3.1.4 was missing three replacements asyncCall -> CommCallBack in client_side.cc. Therefore it didn't compile under openbsd with gcc 3.x (don't know why it compiled under Linux with gcc 4.x, though).
Comment 65 Henrik Nordstrom 2010-08-03 13:33:28 UTC
Is Bug #2774 a dup of this, or is it two different pure virtual methods?
Comment 66 Alex Rousskov 2010-08-03 13:48:14 UTC
(In reply to comment #65)
> Is Bug #2774 a dup of this, or is it two different pure virtual methods?

We do not know, but it probably is a dupe. I asked for more info on bug #2774 but did not get a response.
Comment 67 Alex Rousskov 2010-08-23 17:21:32 UTC
A fix, based on the patches posted here, is now committed to trunk as r10779.

Stefan, do you mind if I add you to the CONTRIBUTORS file? Do you want your email there?
Comment 68 Amos Jeffries 2010-08-23 17:30:34 UTC
Alex, his details will already be making their way in from earlier bug patch contributions. That will happen when the automatic http://www.squid-cache.org/Versions/v3/3.HEAD/changesets/AUTHORS file next gets synced with CONTRIBUTORS.
Comment 69 Alex Rousskov 2010-08-23 21:36:50 UTC
(In reply to comment #68)
> Alex, his details will already be making their way in from earlier bug patch
> contributions. That will happen when the automatic
> http://www.squid-cache.org/Versions/v3/3.HEAD/changesets/AUTHORS file next gets
> synced with CONTRIBUTORS.

Sounds good. Since Stefan is not the author of the code committed for this bug, I did not realize his name will be automatically listed.
Comment 70 Amos Jeffries 2010-08-24 20:08:43 UTC
Bah. Wrong Stefan, Sorry for the noise.
Comment 71 Stefan Fritsch 2010-08-25 03:49:55 UTC
(In reply to comment #67)
> A fix, based on the patches posted here, is now committed to trunk as r10779.

Thanks. Please add a note here when it gets backported to 3.1.x.

> Stefan, do you mind if I add you to the CONTRIBUTORS file? Do you want your
> email there?

I certainly don't mind being listed there including my mail address. FWIW my debugging activities for squid are sponsored by GeNUA mbH.
Comment 72 Amos Jeffries 2010-08-27 10:17:36 UTC
Applied to 3.2 and 3.1.

NP: if someone can confirm the next 3.1 snapshot works properly with it in there will be an early 3.1.8.
Comment 73 Amos Jeffries 2010-09-03 11:13:17 UTC
*** Bug 2767 has been marked as a duplicate of this bug. ***
Comment 74 Amos Jeffries 2010-09-08 19:16:22 UTC
*** Bug 3041 has been marked as a duplicate of this bug. ***
Comment 75 Amos Jeffries 2012-09-21 23:09:58 UTC
*** Bug 2774 has been marked as a duplicate of this bug. ***