Pesto Blog

[Updated 2022-03-09] Performance and stability progress

Doug Safreno
Doug Safreno
Dec 14, 2021 - 6 min read
[Updated 2022-03-09] Performance and stability progress

We're not doing well enough on performance and stability. To create clarity between our users and our team on what we're doing to address this, I'll track our progress in this document until we've made significant headway.

[Please note that I'll put down the bigger projects here, but there are many other smaller pieces of work also in flight]


2022-03-09: last week, we discovered a major issue that was causing some specific rooms in some teams to stop working with our v2 infrastructure. Fix has been deployed. Updated the project list; the big electron / memory usage project will be landing soon.
2022-01-28: "eliminate A/V write collisions" was completed. We also migrated the rest of the web codebase to TypeScript to make it harder for bugs to get through to prod. Finally, we're now working on "Deploys can occasionally cause the app to crash." I also updated the significance ratings on stability issues.
2022-01-05: Happy New Year! 2 memory leaks were fixed today, alongside a big performance fix for >20 person teams. The pace of updates will be a bit slower as we take on bigger chunk of works, but lots more enhancements are in progress.
2021-12-22: we've found 4 memory leaks, and are working on fixes for all of them. Updated statuses under background to reflect recent developments. Probably the last update of the year before our end-of-year shutdown - happy holidays!
2021-12-20: over the weekend, we released 4 projects - "non-AV app lag", "unnecessary rendering", "partial updates", and "periodic timeouts on infrastructure." The app should be noticeably snappier and more stable as a result. We'll keep an eye on things. Also, we started work on memory usage.
2021-12-17: discovered new issue "unnecessary rendering" and added to web client section. Sean is starting on this work immediately so also added to work-in-progress. Many fixes going out this weekend; expect a lot of things to move to "done" soon.
2021-12-16: discovered primary web client issue and updated "Non-A/V app lag" with details. Added information about Anna's work under "Other"
2021-12-15: upgraded data infrastructure from "insignificant" to "medium" after discovering periodic instability (~20s a couple times per week). Added "Periodic timeouts on infrastructure" project to get to the bottom of it
2021-12-14: published initial version of post

Background: Where is the instability / bad performance coming from?

A/V client (mobile and desktop/web): Known medium-large source

There's a lot of work to be done on the client side of the A/V equation, particularly when network connections get choppy. While we do a good job of downgrading on the receiver side of a given media stream, we aren't doing a great job of downgrading on the sender side, leading to an unreliable connection being broadcast. These issues are particularly pronounced when video is present, and seem to be worse on mobile. Fixing bandwidth issues in the data infrastructure has helped somewhat with this.

Electron usage: Known medium source

We're not using electron optimally, and this results in some particularly bad performance in the desktop app vs in the browser. We are actively diagnosing.

Deploys: Known medium source

When we deploy new changes, our servers "roll" clients from the old servers to the new ones. There is some instability that can occur during this time, largely from too many clients handing off at the same time and overloading the new servers. We've largely fixed this now (see "Better sync rolls" below).
Update 1/28/22: Sometimes deploys can cause client crashes. We're working on  a fix.

Web client: Known small source

We've been primarily focused on our AV infrastructure and data infrastructure over the past 12 months, and as a result, our web/desktop client had not gotten a lot of love and became the #1 source of stability and performance issues. We fixed the most severe of these (see "non-A/V app lag", "unnecessary rendering" below) very recently and are evaluating what else can be improved.

A/V infrastructure: Known small source

We use WebRTC as part of our v2 a/v infrastructure. Our servers are based on the Janus open-source SFU with our own proprietary signaling layer. We operate seven clusters across the world to provide low latency. These are largely running stably and performing well, but occasionally have blips. We are doing work to better understand these.

Data infrastructure: Known small source

We released our new infrastructure the first week of November. We've fixed a lot of these issues by now, including one major enhancement in this area (see "Partial Updates" below). We also fixed a periodic issue with timeouts (see "Periodic timeouts on infrastructure" below).

Session Logic: Known, small source

Ever see the "Take session" screen? Our session logic is designed to prevent multiple clients for the same user from joining audio/video at the same time, but operates at way too high of a level (before loading any of the app, not the meeting). This is a known issue that has received many mitigations but ultimately needs a complete rework (coming this year).

Work in progress

🔨 March 2022 - Combo actions

  • Impact: lag when using the web, desktop, or mobile apps
  • Details: "combo" actions will perform complex, multi-write operations all at once, while maintaining realtime sync
  • Author: Doug

🔨 March 2022 - Electron usage issues

  • Impact: lag when using the desktop app
  • Details: TBD (investigation in progress)
  • Author: Sean

🔨 March 2022 - Memory usage

  • Impact: consuming too many resources
  • Details: 4 leaks found; more possibly lurking. 2 fixes are in, the rest will come in with the Electron usage fixes above.
  • Author: Sean

On deck

📝 Web A/V in low-bandwidth / bad connection situations

  • Impact: more reliable calls from web/desktop app
  • Details: make A/V client degrade gracefully when uplink is unreliable or low-bandwidth
  • Author: TBD

📝 Mobile A/V in low-bandwidth / bad connection situations

  • Impact: more reliable calls from mobile
  • Details: make A/V client degrade gracefully when uplink is unreliable or low-bandwidth
  • Author: TBD

📝 Session rework

  • Impact: no more "take session" screens
  • Details: Allow clients who don't have the session to still login to the app, but just prevent them from loading media. Design UI for this case.
  • Author: TBD

Recently Completed

✔️ Feb 2022 - Mobile loading taking a long time for some users

  • Impact: for some users, opening the mobile app can take seconds to even minutes
  • Details: Fixed bug where it actually wouldn't load at all some times
  • Author: Ryan

✔️ Jan 2022 - Web TypeScript conversion

  • Impact: bugs that can be easily caught by a typed language were leaking out to production
  • Details: Change the remaining JS to TypeScript to make this impossible
  • Author: Whole team

✔️ Jan 2022 - Eliminate A/V write collisions

  • Impact: when performing common meeting actions like muting/unmuting, getting errors or actions being silently undone
  • Details: Change the DB table layout, and corresponding API, to write on different user fields at the same time
  • Author: Ryan

✔️ Dec 2021 - Partial Updates (data infrastructure enhancement)

  • Impact: further decreases bandwidth usage and associated call quality problems
  • Details: We are further optimizing our data infrastructure by sending only partial updates over the wire, so that when an object updates, we don't need to send its entire payload. This should further help with bandwidth and associated call quality issues.
  • Author: Sean

✔️ Dec 2021 - Non-A/V app lag

  • Impact: common actions, like exiting a full screen share, joining a room, opening a discussion are taking too long
  • Details: we are rendering too many DOM nodes in our avatar asset system which is bogging down the whole app. Changed the way we render avatars from <svg> tags to <img> tags to fix the issue.
  • Author: Doug

✔️ Dec 2021 - Periodic timeouts on infrastructure

  • Impact: a couple times per week, requests take longer for 20-30s and can timeout all together
  • Details: API nodes are locking up because of issue during open graph queries for unfurling links. Fix incoming.
  • Author: Sean

✔️ Dec 2021 - unnecessary rendering

  • Impact: periodic client side lag
  • Details: we are re-rendering large chunks of the app too much on the basis of some object changes that don't matter for most of the app
  • Author: Sean

✔️ Dec 2021 - Better sync rolls

  • Impact: eliminate instability during deploys
  • Details: when we do deploys, drain our sync servers more slowly so that we don't overwhelm our infrastructure with connections
  • Author: Sean

✔️ Nov 2021 - fixed major data infrastructure bandwidth bug

  • Impact: fixes major performance issue / bandwidth issue that largely manifest in call problems (audio drops, lag)
  • Details: bug caused certain objects to get extraordinarily large (> 1MB) and then be repeatedly sent over the network to clients. Partial updates, another project, planned to further enhance this.
  • Author: Sean

✔️ Nov 2021 - Real-time data engine (hereafter "data infrastructure")

  • Impact: Increases scale and capability relative to Firebase real-time database. The biggest project in company history.
  • Details: read more here
  • Author: Entire team worked on this

Icebreakers for Virtual Teams

Icebreakers for Virtual Teams

Katherine Luo
Katherine Luo
Dec 10, 2021 - 1 min read