Recall ai reduced AWS bill by >$1,000,000 / year
Recall ai reduced AWS bill by >$1,000,000 / year
Reducing bots CPU usage by up to 50% doing 👇
Problem
They identified majority of CPU time was spent in two "copy memory" functions
1. __memmove_avx_unaligned_erms
2. __memcpy_avx_unaligned_erms
Biggest callers of these functions are,
1. Python WebSocket that was receiving the data
2. Chromium's WebSocket that was sending the data
Expensive sockets
- Single 1080p 30fps video stream, in uncompressed I420 format = 93.312 MB/s
- Monitoring showed that at scale, the p99 bot receives 150MB/s of video data
Solution
- Shared Memory was implemented simultaneously accessed by multiple processes at a time
- Chromium writes to a block of memory, which is read directly by video encoder with no copying
- Ring buffer was chosen as high level transport design
Ring Buffer implementation
Three pointers on Ring buffer
- Write pointer: the next address to write to
- Peek pointer: the address of the next frame to read
- Read pointer: the address where data can be overwritten
Step By Step
1. Frames from the peek pointer are fed into media pipeline to support zero-copy reads
2. Read pointer is advanced when the frame has been fully processed
3. Media pipeline is safely holding a reference to the data inside the ring buffer
Recall ai reduced AWS bill by >$1,000,000 / year
Reducing bots CPU usage by up to 50% doing 👇
Problem
They identified majority of CPU time was spent in two "copy memory" functions
1. __memmove_avx_unaligned_erms
2. __memcpy_avx_unaligned_erms
Biggest callers of these functions are,
1. Python WebSocket that was receiving the data
2. Chromium's WebSocket that was sending the data
Expensive sockets
- Single 1080p 30fps video stream, in uncompressed I420 format = 93.312 MB/s
- Monitoring showed that at scale, the p99 bot receives 150MB/s of video data
Solution
- Shared Memory was implemented simultaneously accessed by multiple processes at a time
- Chromium writes to a block of memory, which is read directly by video encoder with no copying
- Ring buffer was chosen as high level transport design
Ring Buffer implementation
Three pointers on Ring buffer
- Write pointer: the next address to write to
- Peek pointer: the address of the next frame to read
- Read pointer: the address where data can be overwritten
Step By Step
1. Frames from the peek pointer are fed into media pipeline to support zero-copy reads
2. Read pointer is advanced when the frame has been fully processed
3. Media pipeline is safely holding a reference to the data inside the ring buffer
4. That reference is guaranteed to be valid until the data is fully processed and the read pointer is advanced
5. Atomic operations are used to update the pointers in a thread-safe manner
🤔 over to you, how to signal that new data is available or buffer space is free? 4. That reference is guaranteed to be valid until the data is fully processed and the read pointer is advanced
5. Atomic operations are used to update the pointers in a thread-safe manner
🤔 over to you, how to signal that new data is available or buffer space is free?
Comments
Post a Comment