next up previous
Next: Q-Learning Experiments Up: Reinforcement Learning Previous: Reinforcement Learning

Algorithm

Let $N_{src}$ be a node in the overlay that is asked to lookup data that belongs to $N_{dst}$. Initially we assume that every node in the network maintains a set of Q values for all destinations for each node it has a route to. Let $Q_{x}(y,dst)$ be $N_{x}$'s estimated cost to end a query destined for $N_{dst}$ via $N_{y}$. Cost can be hops in the overlay, physical topology hops or physical latency depending on the particular optimization. Each node $N_{x}$ that receives the lookup query finds its estimated cost $Q_{x}(y,dst)$ for sending a query destined to $N_{dst}$ via neighbor $N_{y} \in \{routes_{N_{x}}\}$. With probability $P() =
anneal$ $N_{x}$ selects a random route to forward the query along, otherwise it is sent along the estimated best cost route. We describe the probability of selecting a random route after describing the estimate update. After $N_{x}$ forwards the query to $N_{y}$, $N_{y}$ replies to $N_{x}$ with its best estimate of the number of hops remaining to forward the query. Specifically, there is some third node $N_{z} \in
\{routes_{N_{y}}\}$ that has the minimum cost to deliver the query to $N_{dst}$. $N_{x}$ then updates its estimate based on $N_{y}$'s response as follows:

\begin{displaymath}
Q_{x}(y,dst) = Q_{x}(y,dst) + \eta( Q_{y}(z,dst) + cost - Q_{x}(y,dst))
\end{displaymath}

We note that the reinforcement learning algorithm to find the globally optimal route is an approximation of the ideal shortest paths algorithm, for instance Bellman-Ford. However, network complexity makes a continuous all-pairs shortest path estimation impossible in the face of a rapidly changing network. Instead of synchronously updating all path vectors, costs are updated asynchronously whenever a particular destination looked up.

Within reinforcement learning there is a tradeoff between exploration and exploitation. A node may continually make random choices in order to learn more complete information about its routing (exploration), but this leads to poor overall performance since the node is not using any of the learned information to its advantage (exploitation). We use simulated annealing to balance exploration and exploitation. The algorithm begins with certain probability ($anneal = 1$) of selecting a random route. Each time the node updates its Q values as described above, if the update is unchanged within some $\epsilon$ value ( $\epsilon \ll 1$), the probability of choosing a random route in a subsequent iteration is reduced by $\delta$. In this way, the "temperature" cools as the algorithm fails to find a better route allowing for exploitation. We use $\epsilon = 0.001$ and $\delta = 0.005$ in these experiments.

In order to allow the nodes to explore new paths after membership changes, we reset $anneal$ when a node discovers that another node to which is has a route is not responding. In addition, we reset $anneal$ for node $N_{x}$ when node $N_{y} \in \{routes_{N_{x}}\}$ installs a new route.


next up previous
Next: Q-Learning Experiments Up: Reinforcement Learning Previous: Reinforcement Learning
Robert Beverly 2003-12-03