Recall ai reduced AWS bill by >$1,000,000 / year

November 23, 2025

Reducing bots CPU usage by up to 50% doing 👇

Problem

They identified majority of CPU time was spent in two "copy memory" functions

1. __memmove_avx_unaligned_erms

2. __memcpy_avx_unaligned_erms

Biggest callers of these functions are,

1. Python WebSocket that was receiving the data

2. Chromium's WebSocket that was sending the data

Expensive sockets

- Single 1080p 30fps video stream, in uncompressed I420 format = 93.312 MB/s

- Monitoring showed that at scale, the p99 bot receives 150MB/s of video data

Solution

- Shared Memory was implemented simultaneously accessed by multiple processes at a time

- Chromium writes to a block of memory, which is read directly by video encoder with no copying

- Ring buffer was chosen as high level transport design

Ring Buffer implementation

Three pointers on Ring buffer

- Write pointer: the next address to write to

- Peek pointer: the address of the next frame to read

- Read pointer: the address where data can be overwritten

Step By Step

1. Frames from the peek pointer are fed into media pipeline to support zero-copy reads

2. Read pointer is advanced when the frame has been fully processed

3. Media pipeline is safely holding a reference to the data inside the ring buffer

Recall ai reduced AWS bill by >$1,000,000 / year

Reducing bots CPU usage by up to 50% doing 👇

Problem

They identified majority of CPU time was spent in two "copy memory" functions

1. __memmove_avx_unaligned_erms

2. __memcpy_avx_unaligned_erms

Biggest callers of these functions are,

1. Python WebSocket that was receiving the data

2. Chromium's WebSocket that was sending the data

Expensive sockets

- Single 1080p 30fps video stream, in uncompressed I420 format = 93.312 MB/s

- Monitoring showed that at scale, the p99 bot receives 150MB/s of video data

Solution

- Shared Memory was implemented simultaneously accessed by multiple processes at a time

- Chromium writes to a block of memory, which is read directly by video encoder with no copying

- Ring buffer was chosen as high level transport design

Ring Buffer implementation

Three pointers on Ring buffer

- Write pointer: the next address to write to

- Peek pointer: the address of the next frame to read

- Read pointer: the address where data can be overwritten

Step By Step

1. Frames from the peek pointer are fed into media pipeline to support zero-copy reads

2. Read pointer is advanced when the frame has been fully processed

3. Media pipeline is safely holding a reference to the data inside the ring buffer

4. That reference is guaranteed to be valid until the data is fully processed and the read pointer is advanced

5. Atomic operations are used to update the pointers in a thread-safe manner

🤔 over to you, how to signal that new data is available or buffer space is free? 4. That reference is guaranteed to be valid until the data is fully processed and the read pointer is advanced

5. Atomic operations are used to update the pointers in a thread-safe manner

🤔 over to you, how to signal that new data is available or buffer space is free?

Search This Blog

Radhakrishnan Perumal

Recall ai reduced AWS bill by >$1,000,000 / year

Comments

Post a Comment

Popular posts from this blog

Performance Optimization in Sitecore

Strategies for Migrating to Sitecore from legacy or upgrading from older Sitecore

Azure Event Grid Sample code