Gureckis & Love (2015)
Gureckis & Love (2015) Pt.2
Nivand Schoenbaum 2008
Dawetal 2006
Key Terms
100

What is reinforcement learning closely related to? (2 things)

Instrumental conditioning and operant conditioning

100

What is a prediction error?

It’s the difference between what you expected to happen and what actually happened (difference helps you learn)

100

What brain areas were linked to exploiting known rewards?

Reward related areas (striatum, prefrontal cortex) were more active when people chose options they already knew were good

100

What is the study of how animals and machines adapt their behavior in order to maximize their reward in an experience?

Reinforcement Learning (RL)

200

What was Thorndike’s experiment and its 3 key features?

A cat in a puzzle box is trying to escape to get to food. 3 key features are:

1. exploration (the cats finding its way out) 

2. exploitation (using previous actions for quicker escape)

3. the goal is complex

200

How does temporal difference learning explain learning over time?

It says we update our expectations step by step as events happen, not just at the end, so we can start predicting rewards earlier

200

What brain areas were linked to exploration?

Regions involved in planning and control were more active when people tried new/uncertain options

200

What is a theoretical framework for understanding how artificial and natural agents make decisions and offer insights to the overall function/objective of adaptive decision making

Computational Reinforcement Learning (CRL) 

300

What does the next state (a distinct situation the agent might be in) in transition probabilities depend on AND what does it not depend on?

Depends on only the current state and action. Does not depend on the full history of actions that led up to that point

300

What is the ε-greedy algorithm and what is one downside?

it is an explore/exploit strategy that chooses the option associated with the highest value Q(s,a) but chooses randomly from available alternatives. 

DOWNSIDE: it continues to explore with probability ε after agent has gained experience in environment

300

How are dopamine neurons related to prediction errors?

They fire more when rewards are better than expected, don’t change when rewards are expected, and fire less when rewards are missing (like a prediction error signal)

300

What does this study show about how the brain makes decisions?

It suggests the brain may use different systems (one focused on reward habits and the other helps us take a step back and try new strategies)

300

What is the softmax rule?

A guide to exploration by expected value and are determined probabilistically on the basis of actions’ relative expective values

400

What is the optimal policy?

Returns most reward over long term across all possible states of the environment

400

Why do the authors say dopamine is not the whole story?

Because some brain signals that look like prediction errors may actually reflect attention or other processes, meaning learning is more complex

400

Why is the exploration exploitation problem important for decision making?

We constantly have to decide whether to stick with something we know works (exploit) or try something new that might be better (explore). 

Helps us adapt to changing situations

500

Why is Q-learning considered an “off-policy” learning algorithm?

The prediction error for the current choice assumes that the agent will choose the best action on the next state rather than following their own current policy

500

What is one limitation of reinforcement learning as a theory of the brain?

It explains reward based learning well, but doesn’t fully explain things like planning, memories, or learning relationships between events

500

How did the researchers study this problem using computational models?

They used reinforcement learning models to track how people learned from rewards and predicted which choices were exploring vs exploiting