What is the Bayes' Rule?
See slides
Maximize rewards
What is the goal of a reinforcement learning agent
What are the three components of a feed-forward network?
Pre-synaptic neurons, Post-synaptic neurons and the synaptic weights.
Given two events a and b, if we change a to a', then the probability of b would have been different.
What is the definition of causality?
For the same input and output shape.
Why do we apply zero-padding?
The end product is mixture of gaussians.
What do you get when you multiply multiple different gaussian distributions?
I look into the long-term rewards when I am one and I look into short-term rewards when I am zero.
What is the discount factor?
By setting the dynamics to zero.
How can we achieve a dynamical system with fixed points?
In large systems, correlation and causation are often not similar.
Why do we say correlation is not causation?
By using L2, L1 Penalty terms
How can we apply regularisation in Deep Learning?
What do we assume when we compute the log likelihood?
Samples are iid.
If I don't try new actions, I will not perform well.
What is the exploration and exploitation trade-off?
This type of fixed points has negative slope and this type of fixed points has positive slope.
What is the difference between unstable and stable fixed points?
Estimate the causal effect of X on Y
What does causal inference means?
All units from current layer connects to all units in the next layer
What is a fully-connected layer?
The process where we sum over all the other variables.
What is it mean by marginalisation?
Sample vs bootstrapping
What is the difference between Monte Carlo methods and Temporal difference learning?
What is the lifetime of a protein?
days?
I influence the treatment but never the outcome.
What is an instrument variable?
By using skip connections
How can I smoothen the landscape of the loss function?
What is utility equivalent to?
-cost.
It corresponds to dopamine activity.
What does reward prediction error corresponds to?
What is the lifetime of a neuron?
100 years
Does X cause Y?
To overcome exploding or large gradients
Why do we use Relu instead of sigmoid activation functions?