Reggie not answering on lookup

Discussion:

Patrick Wright

2009-02-11 11:49:42 UTC

Hi

First of all--should I be posting questions regarding Jini 2.1 here or
on the River list?

We have a problem that's shown up intermittently on our production
systems. We have two LUS (Reggie) instances configured with the same
group name, a number of service interfaces registered (34 at last
count), and about two dozen Jini clients. Via JMX we can monitor the
Jini lookup for any single client, and we have a "service monitor"
Jini client also available via JMX (which tracks services over time,
pings them regularly for reachability using a special interface,
etc.). We also have a command line tool which can locate LUS instances
either by group name or via unicast and display any registrars
discovered and the service interfaces registered with them.

What has happened on occasion (including today) is that, on starting a
given service, the LUS stops responding to any lookup requests. When
the service is shut down again, the LUS will respond and all previous
registrations are still there (as long as the lease hasn't expired).
During this "blackout" period, any clients that already have a handle
on a service operate without problems (they all rely on a
LookupCache), but any attempts to reach the LUS will fail. In our
command-line LUS viewer, the lookup is initiated by simply
new LookupDiscoveryManager(groupNames, new LookupLocator[]{}, this);

where "this" is a DiscoveryListener. Our normal client-lookup stack is
more complex but also relies on multicast discovery (group name +
interface).

What is also unusual is that in today's incident, we have two copies
of the same service installed to two different online hosts (e.g. for
clustering). We can start instance A without problem, but starting
instance B will cause the LUS to stop responding. Shutting down
instance B will clear the problem up. The code in this case is the
same; they are just located on different hosts. Note that otherwise,
all other (34) service interfaces have been available and discoverable
without problem.

Unfortunately, we have not been able to reproduce this in dev or
staging environments, even with the exact same versions of the service
implementations. Trying to track this down while online is very risky;
while we can reproduce the problem, if the LUS remain unreachable, at
some points clients performing a new lookup will fail to find the
services they need.

Note that outside of these "poison-pill" services (about which we see
nothing unusual) our Jini infrastructure has been stable for some time
now.

We'd appreciate any help in trying to track this down.

Thanks
Patrick

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com

Patrick Wright

2009-02-11 12:59:00 UTC

Permalink

Follow-up (sometimes it does indeed help to describe this to someone
in writing): we now suspect a problem with the codebase server for the
"poison pill" service instance. What we just found in testing was that
from some of our hosts, the codebase server is not reachable, likely
due to a misconfigured firewall. We suspect that would cause any
resolution of service instances to block as the Jini clients tried to
download that service's downloadable jars.

I will post back to the list (for posterity) what we find out.

Thanks
Patrick

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com

Gregg Wonderly

2009-02-11 14:31:29 UTC

Permalink

Post by Patrick Wright
Follow-up (sometimes it does indeed help to describe this to someone
in writing): we now suspect a problem with the codebase server for the
"poison pill" service instance. What we just found in testing was that
from some of our hosts, the codebase server is not reachable, likely
due to a misconfigured firewall. We suspect that would cause any
resolution of service instances to block as the Jini clients tried to
download that service's downloadable jars.
I will post back to the list (for posterity) what we find out.

I use my vhttp: protocol handler to cache, locally, downloadable codebase jars
to speed thing up when there are lots of services to find and/or limited
bandwidth. One of the features of this protocol handler is that it logs any
failure to download something. It is amazing how helpful this kind of logging
is for debugging codebase server issues. It is the first sign of an incorrect
hostname on a new server etc.

Gregg Wonderly

--------------------------------------------------------------------------
Getting Started: http://www.jini.org/wiki/Category:Getting_Started
Community Web Site: http://jini.org
jini-users Archive: http://archives.java.sun.com/archives/jini-users.html
Unsubscribing: email "signoff JINI-USERS" to ***@java.sun.com