Ruby Central’s Infrastructure Improvements for RubyGems.org
At Ruby Central, we know that RubyGems.org is more than just a tool—it’s a lifeline for Ruby developers worldwide. With millions of daily downloads, RubyGems.org supports developers globally by providing secure, reliable access to Ruby gems. To maintain its resilience and scalability, we’ve embarked on a series of essential infrastructure improvements designed to optimize both performance and cost.
Here’s how we’re strengthening the foundation of RubyGems.org to serve the Ruby community better:
From Rackspace to AWS
RubyGems.org has evolved significantly since its early days. Originally hosted on platforms like Heroku and Rackspace, it was maintained by individuals who volunteered their time and resources. But as demand grew, so did the need for a more scalable, stable infrastructure.
This led us to AWS, where we now leverage a Kubernetes-based architecture that supports high traffic volumes. This flexible, scalable setup ensures RubyGems.org remains reliable as it continues to grow with the community’s needs.
Addressing Technical Debt for Long-Term Stability
When the new on-call team assumed responsibility for managing RubyGems.org, we inherited technical debt from years of incremental updates. Key infrastructure components required updates to ensure security, stability, and compatibility with modern systems. Here’s a breakdown of the critical upgrades:
Kubernetes
- The version of Kubernetes powering RubyGems.org was approaching end-of-life, necessitating an urgent upgrade to maintain security standards and continued support. We upgraded Kubernetes to the latest version, integrating essential security patches and performance optimizations.
- We also updated core plugins within Kubernetes, such as CoreDNS and kube-proxy, which are critical in facilitating reliable internal communication across services. These upgrades strengthened the platform’s stability and improved processing speed, benefiting the entire Ruby community.
Postgres Database on AWS RDS
- Our database layer, managed by Postgres on AWS’s Relational Database Service (RDS), was running on a version nearing end-of-life. This posed a security risk and limited our ability to utilize newer database features.
- By upgrading to a newer Postgres version, we enhanced the database’s security and performance while gaining access to advanced features supporting RubyGems.org's long-term stability.
- The on-call team established a process to upgrade our database without any downtime, ensuring uninterrupted service for developers.
Datadog for Real-Time Monitoring
- Datadog is at the core of our monitoring strategy, providing real-time visibility into system health and application performance.
- With the enhanced Datadog setup, we now monitor comprehensive performance metrics, including system health, response times, error rates, and database performance. This enables us to set up automated alerts for critical metrics so our team can detect and address issues proactively—often before users are impacted.
- Thanks to the Datadog agent, the rubygems.org team has gained real-time monitoring and alerting of security and vulnerability exposures. This allows us to triage, task, and apply security patches quickly, ensuring our systems remain secure.
- Additionally, Datadog’s new dashboards and reporting tools provide a unified view of the entire system, allowing us to analyze performance across multiple availability zones and quickly identify areas for improvement. This proactive approach ensures that RubyGems.org remains reliable and performant for millions of developers worldwide.
Improving Availability and Redundancy Across AWS Availability Zones
Reliability has always been a top priority for RubyGems.org. Previously, certain critical services, such as our OpenSearch-powered search cluster, were housed within a single AWS availability zone. This setup introduced risks: if an issue occurred within that zone—such as a power outage or network problem—our search feature could experience downtime.
We implemented clustering across multiple availability zones within the same AWS region to improve resilience. If one zone experiences an issue, another can seamlessly take over, maintaining uptime. This redundancy reduces the risk of service interruptions and ensures that critical features like gem search remain consistently available. By clustering key services across multiple zones, we’re ensuring uninterrupted access for developers worldwide.
New Tools and Technologies to Enhance Performance
To meet the evolving needs of RubyGems.org’s growing user base, we’re integrating new tools and technologies to improve performance and reliability:
- Data Management with Amazon S3: Gems are stored in Amazon S3, which provides scalable, durable storage for the thousands of gems available. S3’s high availability ensures that gems are always accessible and securely stored.
- Global Distribution with Fastly CDN: To optimize download speed and minimize latency, RubyGems.org uses Fastly as its content delivery network (CDN). Fastly caches gems across its global network, reducing the load on our primary servers and allowing for faster access. When a gem is requested, it is quickly retrieved from S3, cached, and served to users worldwide, enhancing accessibility and performance. This setup ensures developers can quickly access gems from any location.
- Reliable Caching with Redis: We’re migrating from Memcache to Redis for a more robust caching solution. Unlike the single-node Memcache setup, Redis allows for a distributed, multi-node configuration, providing resilience against node failures. This transition to Redis improves redundancy and enhances caching speed.
Cost Management and Optimization
Operating RubyGems.org on AWS provides the reliability the community expects, but it also requires careful budgeting. We conduct regular cost reviews, and a recent audit of our S3 storage led us to remove outdated logs, resulting in approximately $1,000 in monthly savings.
We are also exploring additional cost-saving strategies, such as tiered storage within S3. By analyzing gem access patterns, we can classify infrequently accessed gems for lower-cost storage tiers while keeping high-demand gems readily accessible.
In addition, we’re investigating utilizing reserved instances for all of our compute resources to save on predictable workloads. By committing to longer-term AWS capacity, we can control compute costs while ensuring reliable service during peak loads.
Support from the Sovereign Tech Agency and AWS
These improvements have been made possible through support from the Sovereign Tech Agency and AWS. STA funding has enabled us to tackle essential technical debt, enhance reliability, and add durability across the infrastructure. Additionally, AWS provides annual credits, which help offset the cost of redundancy across multiple availability zones.
Managing a Community-Supported, 24/7 System
Operating RubyGems.org as a 24/7 service brings unique challenges. Although RubyGems.org serves a global community around the clock, much of its maintenance relies on a small team of volunteers. These dedicated contributors are essential to RubyGems.org’s operation, providing expertise and support to keep it running smoothly.
Looking Ahead: What’s Next for RubyGems.org?
Our commitment to RubyGems.org is ongoing. We’re continuously exploring advanced infrastructure solutions to optimize costs, enhance security, and ensure RubyGems.org remains responsive to the community’s evolving needs. We’re also looking at additional options to improve scalability and redundancy further to future-proof RubyGems.org as the ecosystem grows.
Thanking our Volunteers, Sponsors, and Ruby Community
With the support of our volunteers, sponsors, and community, we’re building a stronger, more resilient infrastructure that will sustain RubyGems.org for years to come.
Thank you for being part of this journey with us and for supporting our work to keep RubyGems.org dependable for developers worldwide.
December 04, 2024