Discussion:
Trying to track down intermittent classserver problems
Patrick Wright
2008-07-25 09:45:23 UTC
Permalink
Hi

We had a problem with one of our Jini clients yesterday that and I'd
appreciate some insight into how we can track this down/debug this
further if it happens again.

In this setup, we have two LUS instances running on server1 and
server2, with the same group name, registering a total of 22 services.
Each LUS has an associated codebase server
(com.sun.jini.tool.ClassServer) running on the same machine. On our
client machine clientmachine1 we were trying to launch one of our Jini
clients which we'd been forced to move off its normal host, which was
down (e.g. we hadn't had this machine in production since a rebuild).

On clientmachine1 we already had one other Jini client, a webserver,
launched. We have JMX monitoring for our Jini lookup within each
server, and we observed that within the webserver the two LUS were
discovered, and all remote services were reachable (each service
supports a sort of "ping()" method for diagnostic/reachability
testing).
From the command line we also verified that we could perform both
multicast discovery of both LUSs via a diagnostic tool we have, which
reports on all services registered with each LUS it finds. Using
another CLI tool, we also contacted the LUS via unicast and issued one
of our remote ping() calls to each of the 22 services. All of this
from clientmachine1, so at a basic level it seems our network was OK.

However, in trying to start our second client process on clientmachine
one, we got the following exception, repeated
INFO | jvm 1 | 2008/07/24 13:40:57 | Jul 24, 2008 1:40:56 PM
net.jini.discovery.LookupDiscovery$UnicastDiscoveryTask run
INFO | jvm 1 | 2008/07/24 13:40:57 | INFO: exception occurred
during unicast discovery to server1:58368 with constraints
InvocationConstraints[reqs: {}, prefs: {}]
INFO | jvm 1 | 2008/07/24 13:40:57 |
java.lang.ClassNotFoundException:
com.sun.jini.reggie.ConstrainableRegistrarProxy
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.net.URLClassLoader$1.run(URLClassLoader.java:200)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.security.AccessController.doPrivileged(Native Method)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.net.URLClassLoader.findClass(URLClassLoader.java:188)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.lang.ClassLoader.loadClass(ClassLoader.java:251)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.lang.Class.forName0(Native Method)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.lang.Class.forName(Class.java:247)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
sun.rmi.server.LoaderHandler.loadClass(LoaderHandler.java:434)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
sun.rmi.server.LoaderHandler.loadClass(LoaderHandler.java:165)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.rmi.server.RMIClassLoader$2.loadClass(RMIClassLoader.java:620)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.rmi.server.RMIClassLoader.loadClass(RMIClassLoader.java:247)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
net.jini.loader.ClassLoading.loadClass(ClassLoading.java:138)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
net.jini.io.MarshalInputStream.resolveClass(MarshalInputStream.java:296)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1575)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1732)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
net.jini.io.MarshalledInstance.get(MarshalledInstance.java:358)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
com.sun.jini.discovery.DiscoveryV1.doUnicastDiscovery(DiscoveryV1.java:397)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
net.jini.discovery.LookupDiscovery$13.run(LookupDiscovery.java:3327)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
java.security.AccessController.doPrivileged(Native Method)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
net.jini.discovery.LookupDiscovery.doUnicastDiscovery(LookupDiscovery.java:3324)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
net.jini.discovery.LookupDiscovery.doUnicastDiscovery(LookupDiscovery.java:3355)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
net.jini.discovery.LookupDiscovery.access$3900(LookupDiscovery.java:696)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
net.jini.discovery.LookupDiscovery$UnicastDiscoveryTask.run(LookupDiscovery.java:1744)
INFO | jvm 1 | 2008/07/24 13:40:57 | at
com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)

The exception reported problems with, alternately, the LUS on server1
and on server2.

These connectivity problems showed up in the first minute of startup,
and continued for about 2 minutes, then seem to repeat in clusters
every 1 - 1.5 minutes thereafter. The problems continued throughout
the afternoon.

We checked the logs on both the LUS and the codebase server. The LUS
logs had no errors, but the codebase server
(com.sun.jini.tool.ClassServer, one instance started along with each
LUS) had these exception
INFO | jvm 1 | 2008/07/25 00:02:33 | reggie-dl.jar requested from
clientmachine1:60346
INFO | jvm 1 | 2008/07/25 00:02:33 | java.net.SocketException:
Connection reset
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.io.DataOutputStream.write(DataOutputStream.java:90)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.io.FilterOutputStream.write(FilterOutputStream.java:80)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
com.sun.jini.tool.ClassServer$Task.run(ClassServer.java:681)
INFO | jvm 1 | 2008/07/25 00:02:33 | jsk-dl.jar requested from
clientmachine1:60347
INFO | jvm 1 | 2008/07/25 00:02:33 | java.net.SocketException: Broken pipe
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.net.SocketOutputStream.socketWrite0(Native Method)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.io.DataOutputStream.write(DataOutputStream.java:90)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
java.io.FilterOutputStream.write(FilterOutputStream.java:80)
INFO | jvm 1 | 2008/07/25 00:02:33 | at
com.sun.jini.tool.ClassServer$Task.run(ClassServer.java:681)

This was also repeated regularly for both reggie-dl.jar and
jsk-dl.jar, but only for client1 (clientmachine1). I was able to
verify for some specific timestamps that download was possible for
clientmachine1, and I think those match the timestamps of the startup
of the first, successful webserver.

We could issue a normal network ping from both server1 to
clientmachine1 and vice-versa without problems. We saw no other signs
of network problems. Again, by the time we were starting our second
Jini client process, our first had already launched and was running
without problems.

We couldn't think of anything further to test.

What I'm interested in is how to track down the cause of this if it
recurs. We were more or less at a loss.


TIA!
Patrick

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com
Gregg Wonderly
2008-07-25 14:59:42 UTC
Permalink
Post by Patrick Wright
What I'm interested in is how to track down the cause of this if it
recurs. We were more or less at a loss.
What operating system was this on? Windows has interesting behavior problems
when you exceed its builtin limit on number of total socket connections
possible. You could track, with netstat, the number of sockets active and see a
consistent bordering number to when this failure occurs to see if there is an OS
limit related to total sockets active.

Gregg Wonderly

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com
Patrick Wright
2008-07-25 15:11:29 UTC
Permalink
Hi Gregg
Post by Gregg Wonderly
Post by Patrick Wright
What I'm interested in is how to track down the cause of this if it
recurs. We were more or less at a loss.
What operating system was this on? Windows has interesting behavior
problems when you exceed its builtin limit on number of total socket
connections possible. You could track, with netstat, the number of sockets
active and see a consistent bordering number to when this failure occurs to
see if there is an OS limit related to total sockets active.
Thanks for the reply. The OS on all systems (client, and both
LUS/codebase servers) are home-grown, 32bit, 2.4x kernel Linux build.
When this happens next, I'll track the count of open sockets on both
sides of the communication to see if it points towards the problem.


Thanks,
Patrick

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com
Dennis Reedy
2008-07-25 15:28:06 UTC
Permalink
Hi Patrick,

One thing I wasnt clear on is whether the machines are multi-homed.
Can you check that?

Dennis
Post by Patrick Wright
Hi Gregg
Post by Gregg Wonderly
Post by Patrick Wright
What I'm interested in is how to track down the cause of this if it
recurs. We were more or less at a loss.
What operating system was this on? Windows has interesting behavior
problems when you exceed its builtin limit on number of total socket
connections possible. You could track, with netstat, the number of sockets
active and see a consistent bordering number to when this failure occurs to
see if there is an OS limit related to total sockets active.
Thanks for the reply. The OS on all systems (client, and both
LUS/codebase servers) are home-grown, 32bit, 2.4x kernel Linux build.
When this happens next, I'll track the count of open sockets on both
sides of the communication to see if it points towards the problem.
Thanks,
Patrick
--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com
Patrick Wright
2008-07-25 16:17:25 UTC
Permalink
Hi Dennis
Post by Dennis Reedy
Hi Patrick,
One thing I wasnt clear on is whether the machines are multi-homed. Can you
check that?
OK, am waiting for an admin to get back to me--but as I (in a very
limited way) understand it, MH applies between network and its
connection to the outside world (e.g. an ISP)--the problem we're
having is on a local network, between 3 machines on the same network,
switch (and rack, for that matter). Can MH have some impact in this
case?

These are all great mysteries to me, however. I will let you know when
I hear back from our admin.


Thanks
Patrick

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com
Gregg Wonderly
2008-07-25 15:42:08 UTC
Permalink
Post by Patrick Wright
Hi Gregg
Post by Gregg Wonderly
Post by Patrick Wright
What I'm interested in is how to track down the cause of this if it
recurs. We were more or less at a loss.
What operating system was this on? Windows has interesting behavior
problems when you exceed its builtin limit on number of total socket
connections possible. You could track, with netstat, the number of sockets
active and see a consistent bordering number to when this failure occurs to
see if there is an OS limit related to total sockets active.
Thanks for the reply. The OS on all systems (client, and both
LUS/codebase servers) are home-grown, 32bit, 2.4x kernel Linux build.
When this happens next, I'll track the count of open sockets on both
sides of the communication to see if it points towards the problem.
The issue that Dennis brought up is another consideration. It might be that
there is a DNS related problem, or a connectivity issue related to which
interface the traffic is originating on.

Gregg Wonderly

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com
Patrick Wright
2008-07-25 16:17:15 UTC
Permalink
Post by Gregg Wonderly
The issue that Dennis brought up is another consideration. It might be that
there is a DNS related problem, or a connectivity issue related to which
interface the traffic is originating on.
Hmm. DNS appeared to be working on both sides (standard ping <server>
worked, for example).

Looking through the ClassServer log files, it appears that the
callback port (to which ClassServer was to respond) was changing with
each request for a downloadable jar. I do recall we checked for a
handful of these ports yesterday, as the exceptions were being logged,
and we did not find them (via netstat) open on the client. However, it
wasn't clear to me how long-lived these ports should be, and in any
case, by the time the exception was produced my guess is the ports
were already closed. A question is whether they were opened, then
closed, never opened at all...? There are no socket-level exceptions
in the client logs; all we have are the CNFEs.

For our client in this case we use default Jini host/port settings
except for the eventListenerExporter (configured via the SDM). The
BasicJeriExporter is configured with a specific IP (bound to a
specific virtual network interface on the client). The port is 0 (e.g.
randomly assigned) for that one endpoint. Other than that, we aren't
specifying particular IP addy or port to listen on.


Patrick

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com
Loading...