Structured P2P systems successively forward a lookup request through the overlay making greedy choices at each node. In our model we assume a large degree of node heterogeneity where some nodes have more complete routing information than others. We also assume highly variable latencies between nodes in the underlying IP topology. Therefore following a locally optimal router at each node hop may not lead to a globally optimal path through the overlay. Minimizing the complete path lookup cost, for example with respect to hops or latency, thus is analogous to solving a dynamic programming problem.
We adapt Q-learning, a form of reinforcement learning, to a distributed environment using the method of [7]. Analogous to a human exploring a alley-way in search of a shortcut to a particular destination, each node explores its possible routes and learn which routes in the overlay lead to the best results for particular lookups.