Let
be a node in the overlay that is asked to lookup data that
belongs to
. Initially we assume that every node in the network
maintains a set of Q values for all destinations for each node it has a
route to. Let
be
's estimated cost to end a query
destined for
via
. Cost can be hops in the overlay,
physical topology hops or physical latency depending on the particular
optimization. Each node
that receives the lookup query finds its
estimated cost
for sending a query destined to
via neighbor
. With probability
selects a random route to forward the query along,
otherwise it is sent along the estimated best cost route. We describe the
probability of selecting a random route after describing the estimate
update. After
forwards the query to
,
replies to
with its best estimate of the number of hops remaining to forward
the query. Specifically, there is some third node
that has the minimum cost to deliver the query to
.
then updates its estimate based on
's response
as follows:
We note that the reinforcement learning algorithm to find the globally optimal route is an approximation of the ideal shortest paths algorithm, for instance Bellman-Ford. However, network complexity makes a continuous all-pairs shortest path estimation impossible in the face of a rapidly changing network. Instead of synchronously updating all path vectors, costs are updated asynchronously whenever a particular destination looked up.
Within reinforcement learning there is a tradeoff between exploration and
exploitation. A node may continually make random choices in order to
learn more complete information about its routing (exploration), but this
leads to poor overall performance since the node is not using any of the
learned information to its advantage (exploitation). We use simulated
annealing to balance exploration and exploitation. The algorithm begins
with certain probability (
) of selecting a random route.
Each time the node updates its Q values as described above, if the update
is unchanged within some
value (
), the
probability of choosing a random route in a subsequent iteration is
reduced by
. In this way, the "temperature" cools as the
algorithm fails to find a better route allowing for exploitation. We use
and
in these experiments.
In order to allow the nodes to explore new paths after membership changes,
we reset
when a node discovers that another node to which is has
a route is not responding. In addition, we reset
for node
when node
installs a new route.