Error

publication croisée depuis : https://lemmy.pierre-couy.fr/post/2152233

I've been playing with deep reinforcement learning for a while. I originally started with a simple DQN, added all improvements from the Rainbow paper, and finally changed C51 for a quantile regression (and plan to swap it for an Implicit Quantile Network).

After implementing C51 (which was my first time with distributional RL) I started playing with policies that take advantage of the learned distributions : By independently taking N samples from each action-value distribution, scoring actions by averaging the samples, and picking the greedy action with respect to these scores, I was able to make the agent learn faster than similar agents using only NoisyNets or an epsilon-greedy policy (I'm still using NoisyNet, this is done on top of it). In the limiting cases, N=1 is just Thompson Sampling and N=+Infinity is just a plain greedy policy.

Finding an optimal value for N proved to be a challenge, so I decided to pick a random value for it at the start of each episode (N = 2**rng.uniform(8,12) for a QR-DQN with 32 quantiles/action works well in my experiments), which led to even better results.

I later found out about DLTV which made the agent discover new behaviors, but performed worse than previous experiments overall. Inspired by it, I tried something I did not find in previous works and got the best results out of all my previous experiments :

At each time step, compute an exploration_score as the ratio of "intra-action variance" over "inter-action variance" (rendered latex equation). I then take N/exploration_score samples from each distribution, and pick an action as described above. (more details at the end of this post)

For anyone reading this, I have a few questions :

  1. Are you aware of any previous work I missed that tries similar exploration policies with distributional RL (interpolating between Thompson sampling and the greedy policy)
  2. Most papers I found about learning from multiple exploration policies seem to be in the context of multi-actor parallelization. Is there any novelty in randomizing the policy parameters at the start of each episode, especially in the single-actor case ?
  3. Is any part of what I'm doing worth the time it would take to quantitatively evaluate it ? I've been doing it mainly for learning and fun and have only qualitatively evaluated it so far. However, if there's a chance I can contribute to the field, I'll gladly make some time to compare it to published papers on ALE.

A few more details

I actually track a moving average and standard deviation of the exploration score, which lets me shift/rescale its values to a target average and standard deviation, and divide N by the shifted/rescaled value. I initially started with a target average of 1 and standard deviation of 1 as well (which gave good results), then tried randomizing these parameters at the start of each episode as well. This led to a lot more diversity in the policies and even better results.

Since this worked so well, I additionally randomized the noise strength in the NoisyNet layers.

Overall, this made the agent a lot more robust to deviating from what it considers to be the optimal trajectory, and allowed it to learn complex behaviors previous iterations were never able to learn (e.g. taking a few steps back to gain momentum, waiting for good cycles, or dodging hammer bros)


Watch it learn

For anyone interested, I made a live stream of the training in progress with graphs and some more details on the experiments I'm running. The current training run was started ~2.5 days ago. The agent has finished and unlocked levels up to 5-1, and is currently learning 5-2.


A lot more details

Long text hidden, click to expand

Available actions : The agent does not have access to the up and down buttons, the available actions only use left, right, A and B.

Adding the down button would double the total number of actions (because down can be pressed on top of all available actions).

Reward function : It mainly consists of reward(t) = max(0, x(t) - previous_best_x) + a larger reward for beating a stage. I had to tweak the scaling of both components.

I initially had penalties for time and death, but one made the agent suicidal in front of hard-to-overcome obstacles, while the other made it fear them too much and hug the left side of the screen. Removing both proved to increase the performance.

One trick that seems to help with most '*-3' levels (which have a lot of void to fall into) was to hold the reward while the vertical velocity of Mario is negative (meaning it is falling). Without this trick, the agent would sometimes get stuck learning to jump the farthest it can into the void.

Stage scheduling : Each episode is one attempt on one level. At the start of each episode, a stage is randomly picked with probability proportional to 1/(number of times the stage was beaten) among the unlocked stages. Each stage is unlocked after the previous one has been beaten 30 times, with only 1-1 unlocked at the start of the training.

Available stages : The first iterations of the agent were unable to learn maze castles (4-3, 7-3 and 8-4), so I removed them all. The reward function will give rewards for the first path the agent tries, then the agent will be teleported back by the game and no reward is received until it finds the right path and gets past the point where the game teleported it back. I plan to test newer (better) versions of the agent on these stages only and see if mazes can be re-added to the pool.

I've also removed underwater stages (2-2 and 7-2). The agent can learn them fine, but the game dynamics are really different from all other stages and they're really boring to watch. Since I already removed a bunch of stages, I figured I could remove these as well but I may re-add them with mazes because beating every level is cooler than beating a cherry-picked selection.

Since 8-4 is the only stage that requires going down a pipe, I considered it was not worth it to add the down action and will likely never re-add it to the pool, which would unfortunately be really anti-climactic...

Replay buffer warm-up : After initially using the standard approach of filling the buffer with transitions sampled from a random policy before training the neural net, I came-up with a "soft warm-up" scheme in which the first gradient updates happen after only 2000 transitions, but initially happen every few thousand transitions and gradually become more frequent until the replay buffer is full. Together with my custom exploration policy, this works very well : the agent very quickly starts behaving similar to a "right + random button" policy before learning to actually jump and run.

Custom n-step bootstrapping : When I initially implemented n-step bootstrap targets, I initially used n=3 from the Rainbow paper, noting the same instabilities as the paper did for higher n values. I then found the Retrace(\lambda) paper which seems to successfully address this by increasing n until the online network disagrees with the action choice from a stored transition. This makes n larger where the replay buffer data is on-policy, and smaller when it becomes off-policy. Since my GPU is already maxed and the training is already slow (20.8t/s when real-time is 20t/s) I could not afford the additional computations (building a training sample (s(t), a(t), sum(r(t+0..n)), s(t+n)) needs up to n_max transitions to go through the online network).

I'm trying to achieve similar sample efficiency gains by using cheaper alternatives as proxies for "how off-policy is a given transition" : I'm using the number of times a transition has been sampled, with n = int(max(n_min, n_max * k**times_sampled)) ; 0<k<1. The currently running experiment uses n_max=14, n_min=1 and k=1/1.3. I'm pretty sure it helps early in the training, and it does not collapse like a constant n=14 does

Stream setup : As I said, this is something I do for my own fun, and I really wanted to be able to see the agent learn in real time. The code runs a separate process, to which frames from training episodes are sent in a queue. The process then sends the frames as raw RGB24 to an local UDP socket, to which GStreamer connects and encodes the stream. With a simple MediaMTX configuration, I can manage the Gstreamer process and have the stream available through WebRTC on my LAN.

Then I figured someone else might have fun watching this, so I added a line to my MediaMTX config to send the stream to twitch and youtube. The overlay is a headless browser displaying custom HTML/JS (using d3.js for the graphs) piping raw frames to ffmpeg. GStreamer handles compositing the two streams together into the side-by-side view.

Google open sources tools to support AI model development

2y 2mon ago by lemmy.ml/u/ylai in learnmachinelearning@sh.itjust.works from techcrunch.com
512

Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem

2y 5mon ago by lemmy.ml/u/ylai in learnmachinelearning@sh.itjust.works from pytorch.org
603

Understanding GPU Memory 2: Finding and Removing Reference Cycles

2y 5mon ago by lemmy.ml/u/ylai in learnmachinelearning@sh.itjust.works from pytorch.org
104

PyTorch: Compiling NumPy code into C++ or CUDA via torch.compile

2y 8mon ago by lemmy.ml/u/ylai in learnmachinelearning@sh.itjust.works from pytorch.org
505

Introduction to Kernel Methods for Machine Learning

2y 8mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from seis.bristol.ac.uk
506

The Kernel Cookbook: Advice on Covariance functions

2y 8mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from www.cs.toronto.edu
307

An Intuitive Tutorial to Gaussian Processes Regression

2y 8mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from arxiv.org
708

Applied Machine Learning (Cornell Tech CS 5787, Fall 2020)

2y 8mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from www.youtube.com
309

DeepMind x UCL | Reinforcement Learning Course 2018

2y 8mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from www.youtube.com
3010

Distilling step-by-step: Outperforming larger language models with less training data and smaller model sizes

2y 8mon ago by feddit.ch/u/Chruesimuesi in learnmachinelearning@sh.itjust.works from blog.research.google
17011

Dr Stephen Wolfram says THIS about ChatGPT, Natural Language and Physics

2y 9mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from www.youtube.com
-2312

[Resource] Understanding UMAP - Google PAIR

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from pair-code.github.io
9013

[Resource] MIT OpenCourseWare: Introduction To Machine Learning

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from ocw.mit.edu
4014

DuckAI - An open-source ML research community

2y 10mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from lemmy.intai.tech
13215

[Resource] Style Guide for Python Code: PEP 8

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from peps.python.org
1016

[Resource] MIT OpenCourseWare: Statistical Learning Theory

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from ocw.mit.edu
2017

[Resource] MIT OpenCourseWare: Mathematics Of Machine Learning

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from ocw.mit.edu
6018

[Resource] Durham University Materials for COMP3547 (Deep Learning) and COMP3667 (Reinforcement Learning) from Dr. Robert Lieck

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from github.com
5019

[Resource] Rules of Machine Learning from Google

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from developers.google.com
6020

[Resource] Coding Practices for Python/ML

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from github.com
13021

[Resource] Tutorial: Image Recognition with CNN in Matlab

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from hevpdd.ca
3022

[Resource] Tutorial: State of Charge Estimation with EKF and SVSF in Matlab

2y 10mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from hevpdd.ca
2023

[Resource] Standford University Cheat Sheets for ML (web version)

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from stanford.edu
10024

Mathematics for Neural Networks

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from sh.itjust.works
20025

[Resource] Materials from CORNELL CS4780/CS5780: Machine Learning for Intelligent Systems

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works
8026

K-Means Clustering Infographic

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from sh.itjust.works
21027

My LLM CLI tool now supports self-hosted language models via plugins

2y 11mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from simonwillison.net
2028

Classification Model Evaluation Metrics

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from www.researchgate.net
3029

Introduction to Domain Adaptation for Neural Networks

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from machinelearning.apple.com
6030

The standardization fallacy: the importance of variance

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from www.nature.com
2031

OpenChat_8192 - The first model to beat 100% of ChatGPT-3.5

2y 11mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from lemmy.intai.tech
35632

GitHub - microsoft/Data-Science-For-Beginners: 10 Weeks, 20 Lessons, Data Science for All!

2y 11mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from lemmy.intai.tech
7033

Emerging Architectures for LLM Applications | Andreessen Horowitz

2y 11mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from a16z.com
6034

Mathematical Foundations of Machine Learning

2y 11mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from lemmy.intai.tech
5035

Neural Network Interactive Browser App: Tensorflow Playground

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from playground.tensorflow.org
7036

MPT-30B-Chat - a Hugging Face Space by mosaicml

2y 11mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from lemmy.intai.tech
2037

101 fundamentals for aspiring the model makers

2y 11mon ago by lemmy.intai.tech/u/manitcor in learnmachinelearning@sh.itjust.works from lemmy.intai.tech
4038

[Resource] Good collection of introductions to topics for stats and machine learning: Nature Methods' Points of Significance

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from www.nature.com
7039

[Discussion] What are your favourite tools for searching for info and why?

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works
5840

On sourcing for benchmark datasets: Will the Real Iris Data Please Stand Up?

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from lucykuncheva.co.uk
4041

[Repost Q] Fast python convolution along one axis only

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works
5242

[Repost Q] Need help understanding this code on power consumption prediction using LSTM

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works
4143

[Resource] Deep learning: the unintuitive relationship between overparameterization, overfitting and generalization

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from brain.harvard.edu
4044

[Resource] What's the kernel trick? (explanation)

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from eranraviv.com
3045

[Repost Q] Are autoencoders and auto-associative neural networks the same thing?

2y 11mon ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works
3146

[Repost Q] Are there commonly used shapes for testing 4D+ classifiers? (such as half moons for 2D)

3y 16d ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from www.researchgate.net
4247

[Resource] A good introduction/overview to neural network theory: Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville (Free, online copy)

3y 18d ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from www.deeplearningbook.org
6048

[Repost Q] Should I scale values before using them to train autoencoder?

3y 18d ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works
3949

[Repost Q] Methods describe the temporal consistency of kernel density data

3y 18d ago by sh.itjust.works/u/ShadowAether in learnmachinelearning@sh.itjust.works from i.stack.imgur.com
3150