Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe
- Track: Software Defined Storage devroom
- Room: H.2214
- Day: Saturday
- Start: 10:30
- End: 11:10
- Video only: h2214
- Chat: Join the conversation!
We manage a 200TB glusterfs cluster in production. While we were managing this, we learnt some key points. In this session, we will share with you:
- What are the minimal health checks that are needed for a glusterfs volume, to ensure high availability and consistency
- What are the problems with the current cluster expansion steps(rebalance) in glusterfs we experienced? How did we manage to avoid the need for a rebalancing of data, for our use-case. Proof of concept for new rebalance algo for future.
- How are we scheduling our maintenance activities such that we never have downtime even if the things go wrong.
- How did we reduce the time to replace a node from weeks to a day.
As the number of clients increased we had to scale the system to handle the increasing load, here are our learnings scaling glusterfs
- How to profile glusterfs to find performance bottlenecks.
- Why client-io-threads feature didn't work for us? How we improved applications to achieve 4x throughput by scaling mounts instead.
- How to Improve the incremental heal speed and patches contributed to upstream
- Road map for glusterfs based on these findings
Speakers
SanjuRakonde | |
Pranith Kumar Karampuri |