[Vanilla List] continuum server lag cause found

Sat Aug 19 07:10:00 CDT 2000

On Wed, 16 Aug 2000, James Cameron wrote:
> netrekd sees a memory leak.  It was at 28Mb or so today, and restarting
> it solved the lag burst on login of new players.  Karthik and I isolated
> this by noting that the lag burst happened on multiple connections to
> either the player list port (2591) or the standard ship port (2592) but
> not to connections to the metaserver.

What do you mean by multiple connections?  More than one connection request
within some time window?  Does the process memory usage grow by some amount
after each connection, or do there have to be more than one at a time for the
memory to grow?

> Watching "top" on continuum when Karthik sent ten connection requests
> showed netrekd at 99% CPU utilisation.  Restarting netrekd removed the
> problem altogether; it never appeared in "top".

At this point, was the netrekd process 28MB, or still small?  What I mean is,
does the CPU usage only happen when the memory leak has caused the process to
get huge.

> Karthik uses the same netrekd on his servers, and it is not leaking.
> His servers are Intel, and Continuum is Sparc.
> 
> There are no significant malloc/free heap consumption code within the
> netrekd code we are using.  So the problem is probably related to the
> Sparc Linux libraries or kernel.

Possibly.  I've looked over the newstartd code, and it's pretty nasty.  I
already spotted one bug in a select() call, but I doubt that is the problem.

> I have a core dump of netrekd's 28Mb instance, but I don't know how to
> proceed to determine the ownership of virtual memory.

I don't know to do that easily either.  You could start it up in gdb and look
at some variables.  For instance, you could look at the prog variable and see
how many ports it thinks it has open, etc.