Pinata Outage - June 12, 2025

On June 12, 2025, Pinata suffered a widespread outage of multiple core product functionalities. From approximately 1:04 PM CDT until 3:26 PM CDT, the following Pinata services were partially or completely unavailable:

Uploads
File retrieval
Public IPFS Gateway

These services are core to our customers’ day-to-day operation, so we want to walk through what happened, what steps we took to try to mitigate the problem, and how we will work to avoid similar problems in the future.

What Happened

Google Cloud suffered a large scale outage that affected many service providers. Pinata relies on a multi-cloud strategy to ensure resiliency and redundancy, but does not use Google’s Cloud Platform (GCP) directly. Unfortunately, other providers that are critical to our infrastructure did rely on GCP.

The largest of those providers is Cloudflare. Cloudflare is a key partner in our infrastructure stack, and their reliance on GCP effected much of the internet when GCP suffered its outage. One of the impacted product areas for Cloudflare was their KV service. Pinata makes use of Cloudflare’s KV service to ensure the maximum performance for our IPFS gateways. When this service was unavailable, we were unable to serve content through our gateways.

This KV service is also a necessary part of the upload process. When a file is uploaded to IPFS through Pinata, we need to store metadata about the file to help route, cache, and track the file. With Cloudflare’s KV service down, we were unable to write to the KV and thus could not process uploads.

How We Mitigated The Outage

When it became evident that Cloudflare’s KV store was at the heart of the outage impacting our IPFS Gateway, we focused immediately on restoring core functionality with minimal external dependencies.

1. Isolating the KV Dependency

We began by identifying the components of our gateway architecture that relied on Cloudflare KV. To mitigate this failure point, we leveraged polymorphism within our cache layer to substitute the KV backend with a temporary in-memory hash map. This drop-in replacement preserved the same interface, allowing our services to operate seamlessly without requiring broader architectural changes.

2. Hotfix Deployment via Local Credentials

With the Cloudflare dashboard inaccessible, we pivoted to using a secure, break glass deployment token from a highly restricted password manager. This allowed us to bypass the dashboard entirely and deploy a patched version of our worker using wrangler deploy, first to a staging environment for validation, and later to production.

3. Adapting the Build Pipeline

The outage also impacted npmjs, resulting in instability during the build process. To overcome this, we modified our build script to bypass fresh dependency installations and instead reuse cached packages already present on the build host. This adjustment enabled us to compile and deploy a working bundle without introducing further points of failure.

4. Prioritizing Critical Services

Gateway retrieval is a critical part of our platform, and during the incident, our team recognized an opportunity to mitigate gateway downtime by temporarily turning off certain KV checks that these gateways typically rely on.

Operating without access to KV came with tradeoffs. Features such as metrics collection and restricting gateway retrieval to pinned content were temporarily disabled. After internal discussion and alignment, we agreed that the impact of gateways being 100% down far outweighed the potential risks of removing these KV checks.

Because removing these KV checks opened gateways up to more traffic than usual, we also have marked all traffic that occurred during this period as free traffic, and customers will not it count towards their monthly billable usage.

5. Staging Validation and Full Restoration

Once the staging environment demonstrated stable behavior under the new configuration, we deployed the hotfix to production. This restored full gateway functionality, ensuring our users once again had reliable access to IPFS content.

How We Plan To Prevent This In The Future

Cloud infrastructure is often a maze of dependencies. We didn’t do a good enough job in predicting and managing lower-level dependencies. That’s something we can and will correct.

Going forward, we will audit our infrastructure stack with wider assumptions about underlying dependencies, and we will reduce the impact of any outage as much as possible.

We’ve already begun steps to implement these changes and will continue improving. This will be an ongoing initiative that will require constant checks on our assumptions, and we are fully committed to that.

The internet is a tangled web of interconnected parts, but we have one very simple job, and we failed at that. It’s something we take seriously, and we will get better.

Pinata Outage - June 12, 2025

What Happened