Recall ai reduced AWS bill by >$1,000,000 / year

 Recall ai reduced AWS bill by >$1,000,000 / year 


Reducing bots CPU usage by up to 50% doing 👇

Problem

They identified majority of CPU time was spent in two "copy memory" functions


1. __memmove_avx_unaligned_erms 

2. __memcpy_avx_unaligned_erms

Biggest callers of these functions are,

1. Python WebSocket that was receiving the data

2. Chromium's WebSocket that was sending the data

Expensive sockets

- Single 1080p 30fps video stream, in uncompressed I420 format = 93.312 MB/s

- Monitoring showed that at scale, the p99 bot receives 150MB/s of video data

Solution

- Shared Memory was implemented simultaneously accessed by multiple processes at a time 

- Chromium writes to a block of memory, which is read directly by video encoder with no copying

- Ring buffer was chosen as high level transport design

Ring Buffer implementation

Three pointers on Ring buffer

- Write pointer: the next address to write to

- Peek pointer: the address of the next frame to read


- Read pointer: the address where data can be overwritten

Step By Step

1. Frames from the peek pointer are fed into media pipeline to support zero-copy reads

2. Read pointer is advanced when the frame has been fully processed

3. Media pipeline is safely holding a reference to the data inside the ring buffer

Recall ai reduced AWS bill by >$1,000,000 / year 

Reducing bots CPU usage by up to 50% doing 👇




Problem


They identified majority of CPU time was spent in two "copy memory" functions


1. __memmove_avx_unaligned_erms 


2. __memcpy_avx_unaligned_erms


Biggest callers of these functions are,


1. Python WebSocket that was receiving the data


2. Chromium's WebSocket that was sending the data




Expensive sockets


- Single 1080p 30fps video stream, in uncompressed I420 format = 93.312 MB/s


- Monitoring showed that at scale, the p99 bot receives 150MB/s of video data




Solution


- Shared Memory was implemented simultaneously accessed by multiple processes at a time 


- Chromium writes to a block of memory, which is read directly by video encoder with no copying


- Ring buffer was chosen as high level transport design




Ring Buffer implementation


Three pointers on Ring buffer


- Write pointer: the next address to write to


- Peek pointer: the address of the next frame to read


- Read pointer: the address where data can be overwritten



Step By Step


1. Frames from the peek pointer are fed into media pipeline to support zero-copy reads


2. Read pointer is advanced when the frame has been fully processed


3. Media pipeline is safely holding a reference to the data inside the ring buffer

4. That reference is guaranteed to be valid until the data is fully processed and the read pointer is advanced

5. Atomic operations are used to update the pointers in a thread-safe manner

🤔 over to you, how to signal that new data is available or buffer space is free? 4. That reference is guaranteed to be valid until the data is fully processed and the read pointer is advanced

5. Atomic operations are used to update the pointers in a thread-safe manner

🤔 over to you, how to signal that new data is available or buffer space is free?

Comments

Popular posts from this blog

Performance Optimization in Sitecore

𝗙𝗹𝘂𝗲𝗻𝘁𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 𝗶𝗻 𝗔𝗦𝗣.𝗡𝗘𝗧 𝗖𝗼𝗿𝗲 - 𝗖𝗹𝗲𝗮𝗻, 𝗙𝗹𝗲𝘅𝗶𝗯𝗹𝗲 𝗠𝗼𝗱𝗲𝗹 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗠𝗼𝗱𝗲𝗿𝗻 .𝗡𝗘𝗧 𝗔𝗽𝗽𝘀

Azure Event Grid Sample code