Brussels / 4 & 5 February 2023


Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe

We manage a 200TB glusterfs cluster in production. While we were managing this, we learnt some key points. In this session, we will share with you:

  • What are the minimal health checks that are needed for a glusterfs volume, to ensure high availability and consistency
  • What are the problems with the current cluster expansion steps(rebalance) in glusterfs we experienced? How did we manage to avoid the need for a rebalancing of data, for our use-case. Proof of concept for new rebalance algo for future.
  • How are we scheduling our maintenance activities such that we never have downtime even if the things go wrong.
  • How did we reduce the time to replace a node from weeks to a day.

As the number of clients increased we had to scale the system to handle the increasing load, here are our learnings scaling glusterfs

  • How to profile glusterfs to find performance bottlenecks.
  • Why client-io-threads feature didn't work for us? How we improved applications to achieve 4x throughput by scaling mounts instead.
  • How to Improve the incremental heal speed and patches contributed to upstream
  • Road map for glusterfs based on these findings


Pranith Kumar Karampuri