Baselines

FIFO: first_in_first_out

FIFO: first_in_first_out
The agent will act on the longest waiting passenger. In a tie, it will choose the lowest index (task-focusing) passenger.

Reasoning
This is the same policy used by Uber in their “FIFO Zones” for airport ride queuing. This works well when passengers enter the environment in similar locations and car capacity is low.

Greedy (task focused): greedy_Tfocus

Behavior
Calculate the distance required to pickup and dropoff all present passengers. Then accept, pickup, and dropoff the passenger with the minimum distance. Only considering one passenger at a time.

Reasoning
This behavior is similar to Greedy Matching for taxi dispatch. Whenever a passenger is available (the FIFO in that article) we choose the lowest ETA. This is a decent local optima when passengers enter/exit the environment around the same times, but it doesn’t allow pooling.

Greedy (task global): greedy_Tglobal

Behavior
Calculate the distnnce required to pickup and dropoff all present passengers. Act on the passenger with the shortest distance. This can pool.

Reasoning
This policy is a good example of what can happen with a inefficient pooling policy. Its ability to succesfully deliver pooled passengers is dependent on their relative distances, and often it will switch between tasks rapidly without accomplishing many tasks.

noop

Behavior
The agent takes no action in all states, effectively leaving the environment unchanged where possible. If the environment naturally evolves regardless of the agent’s actions, the no-op policy simply observes without intervention.

Reasoning
The no-op policy serves as a baseline for understanding the impact of inaction in the environment. It highlights the natural dynamics of the environment without any agent interference, providing a benchmark to compare active policies. This policy is particularly useful in identifying whether external factors (e.g., environmental dynamics or other agents) play a significant role in achieving rewards or whether deliberate actions are necessary for success.

random

Behavior
The agent selects actions uniformly at random from the available action space, with no regard for the state, goals, or consequences of the actions.

Reasoning
The random policy establishes a baseline for performance in the absence of any learning or strategy. It demonstrates the environment’s inherent difficulty by showing how likely success is when actions are chosen arbitrarily. This helps evaluate the performance improvement of learned or more sophisticated policies over pure chance. It is especially valuable in stochastic environments where outcomes may vary widely even with random actions.