« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

Juni_DEV

[CS285: Deep RL 2023] Lecture 2, Imitation Learning 본문

Robotics/Reinforcement Learning

[CS285: Deep RL 2023] Lecture 2, Imitation Learning

junni :p 2025. 7. 21. 17:25

Terminology & notation

Markov property (Very Very Important!!!)

If you know the state S2 and you need to figure out the state S3 then S1 doesn’t give you any additional information that means that S3 is conditionally independent S1 given S2.
If you know the state now, then the state in the past does not matter to you because you know everything about the state of the world.
현재는 모든 과거를 온전히 표현한다 → 미래의 state를 예측할때 과거를 기억하지 않고 현재만을 고려함

Imitation Learning

The moral of the story, and a list of ideas

Imitation learning via behavioral cloning is not guaranteed to work
- This is different from supervised learning
- The reason: i.i.d. assumption does not hold!
We can formalize why this is and do a bit of theory
We can address the problem in a few ways:
- Be smart about how we collect (and augment) our data
- Use very powerful models that make very few mistakes
- Use multi-task learning
- Change the algorithm (DAgger)

Why does behavioral cloning fail?

The distributional shift problem

That the distribution under which the policy is tested is shifted from the distribution under which it’s trained .

Where are we…

Imitation learning via behavioral cloning is not guaranteed to work
- This is different from supervised learning
- The reason: i.i.d. assumption does not hold!
We can formalize why this is and do a bit of theory
We can address the problem in a few ways:
- Be smart about how we collect (and augment) our data
- Use very powerful models that make very few mistakes
- Use multi-task learning
- Change the algorithm (DAgger)

What makes behavioral cloning easy and what makes it hard?

Intentionally add mistakes and corrections
- The mistakes hurt, but the correctins help, often more than the mistakes hurt
Use data augmentation
- Add some “fake” data that illustrates corrections (e.g., side-facing cameras)

Why might we fail to fit the expert?

1. Non-Markovian behavior
- that the expert doesn’t necessarily choose the action based only on the current state

2. Multimodal behavior
- that the expert takes actions randomly and their distribution of reactions is very complex and might have multiple modes

Expressive continuous distributions

Quite a few options, many ways to make things work:

mixture of Gaussians
- set of means covariances and weights
- modern auto devtools like pytorch → pretty easy to implement
- problem of the mixture of Gaussians is that you choose a number of mixture elements and that’s how many you have
latent variable models
- provice us a way to represent a much broader class of distributions in fact you can actually show that latent variable models can represent any distribution as long as the neural network is big enough
- random seed ⇒ random numbers aren’t actually correlated with anything in the input or output ⇒ newral net will ignore those numbers
  - trick in training latent variable models is to make those numbers useful during training
- The most widely used type of model of this sort is the (conditional) variational autoencoder
Diffusion models

Does learning many tasks become easier?

Use multi-task learning

Goal-conditioned Behavioral Cloning

can use them as online self-improvement method very similar in spirit to RL
1. Start with a random policy
2. Collect data with random goals
3. Treat this data as “demonstrations” for the goals that were reached
4. Use this to improve the policy
5. Repeat
The idea is initially the policy does mostly random things, but then it learns about the actions that led to the states that actually reached and then it can be more deliberate on the next iterations.
So the mehod simply applies this goal relabeling immitation learning approach iteratively running relabeling imitation, then more data collection, then more relabeling and then more imitation.

Hindsight Experience Replay

Similar principle but with reinforcement learning
This will make more sense later once we cover off-policy value-based RL algorithms
Worth mentioning because this idea has been used widely outside of imitation (and was arguably first propsed there)

Change the algorithm (DAgger)

3. Ask human to label with actions ⇒ sometimes not very natural to ask a human to examine images

Imitaiton learning: what’s the problem?

Humans need to provide data, which is typically finite
- Deep learning works best when data is plentiful

https://rail.eecs.berkeley.edu/deeprlcourse/

https://rail.eecs.berkeley.edu/deeprlcourse/deeprlcourse/static/slides/lec-2.pdf

https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps

CS 285: Deep RL, 2023

Playlist for videos for the UC Berkeley CS 285: Deep Reinforcement Learning course, fall 2023.

www.youtube.com

저작자표시 (새창열림)

'Robotics > Reinforcement Learning' 카테고리의 다른 글

[CS285: Deep RL 2023] Lecture 3, PyTorch and Neural Nets (0)	2025.07.23
[CS285: Deep RL 2023] Lecture 1, Introduction (2)	2025.07.21

'Robotics/Reinforcement Learning' Related Articles

Comments