Jun 18, 2025

Design your own (mini) Cloudflare Images

Services like Cloudflare Images and Imgix offer a powerful, off-the-shelf solution for hosting and processing images. They abstract away the complexity, providing a simple API to solve a difficult problem. What architectural patterns and trade-offs are involved in designing a system that can ingest, process, and serve millions of images efficiently and reliably? I love systems design and I use Cloudflare often, so let's design a mini version of our own:

Core Functional Requirements:

Image Ingestion: A write-heavy endpoint capable of handling high-throughput uploads.
Persistent Storage: A durable, highly-available, and scalable object store for original, high-quality images.
On-the-Fly Transformation: Real-time image processing for resizing, cropping, watermarking, and format conversion (e.g., JPEG to WebP/AVIF).
Global Content Delivery: Low-latency delivery of processed images to users worldwide.
High Availability & Fault Tolerance: The system must be resilient to component failures.

Hopefully you'll learn basic systems design principles, such as handling object storage or caching strategies.

A Simple Implementation

For learning purposes, let's pretend we already have a mini prototype that handles this. We'll keep the stack for our prototype minimal:

Go: The choice for the API service, it's fast, simple, has built-in concurrency, and a strong standard library.
FFmpeg - A powerhouse for image processing.

[User Request]
|
V
[Go API Server]
| |
| V
| [Image Storage] (e.g., local disk, S3-compatible storage)
| ^
| |
V |
[FFmpeg Processor]
|
V
[Cache Layer] (e.g., in-memory, Redis)
|
V
[CDN] --- Serves the cached or processed images

Project Structure

Basic project structure for our Go application:

└── mini-cloudflare-images/
    ├── cmd/
    │   └── main.go         # Entry point
    ├── internal/
    │   ├── handlers/       # API Handlers
    │   │   ├── upload.go
    │   │   ├── retrieve.go
    │   │   └── delete.go
    │   ├── storage/        # Interacting with storage
    │   │   └── local.go
    │   ├── ffmpeg/         # Processing images
    │   │   └── process.go
    │   └── cache/          # Cache
    │       └── cache.go
    └── images/             # Directory to store images
        ├── original/
        └── processed/

Storage Layer

We'll keep it simple and use the local filesystem for storage. In a production system, you might want to use S3 or another cloud storage solution.

// internal/storage/local.go
package storage

import (
 "fmt"
 "os"
 "path/filepath"
)

// Creates the necessary directories for storing images if they don't already exist
func EnsureStorageDirs() error {
 if err := os.MkdirAll(filepath.Join("images", "original"), 0755); err != nil {
  return fmt.Errorf("failed to create original images directory: %w", err)
 }
 if err := os.MkdirAll(filepath.Join("images", "processed"), 0755); err != nil {
  return fmt.Errorf("failed to create processed images directory: %w", err)
 }
 return nil
}

// Returns the full path for an image given its ID and variant
func GetImagePath(variant, imageID string) string {
 return filepath.Join("images", variant, imageID)
}

// Removes an image file from the specified variant directory
func DeleteImage(variant, imageID string) error {
 path := GetImagePath(variant, imageID)
 if err := os.Remove(path); err != nil && !os.IsNotExist(err) {
  return fmt.Errorf("failed to delete image %s: %w", path, err)
 }
 return nil
}

Memory Cache

To avoid reprocessing images every time they are requested, we'll add a simple in-memory cache. Note the sync.RWMutex used for safe concurent reads and writes.

// internal/cache/cache.go
package cache

import "sync"

// A simple thread-safe in-memory key-value store
type MemoryCache struct {
 mu    sync.RWMutex
 items map[string]interface{}
}

func New() *MemoryCache {
 return &MemoryCache{
  items: make(map[string]interface{}),
 }
}

// Note overwrites any existing item
func (c *MemoryCache) Set(key string, value interface{}) {
 c.mu.Lock()
 defer c.mu.Unlock()
 c.items[key] = value
}

// Returns the item or nil, and a boolean indicating whether the key was found
func (c *MemoryCache) Get(key string) (interface{}, bool) {
 c.mu.RLock()
 defer c.mu.RUnlock()
 item, found := c.items[key]
 return item, found
}

API Handlers

Uploading:

// internal/handlers/upload.go
package handlers

import (
 "fmt"
 "io"
 "log"
 "mini-cloudflare-images/internal/storage"
 "net/http"
 "os"

 "github.com/google/uuid"
)

func UploadImage(w http.ResponseWriter, r *http.Request) {
 file, _, err := r.FormFile("image")
 if err != nil {
  log.Printf("Error getting form file: %v", err)
  http.Error(w, "Failed to get image from form", http.StatusBadRequest)
  return
 }
 defer file.Close()

 imageID := uuid.New().String()
 // We'll store originals as .jpg for consistency
 originalFileName := imageID + ".jpg"
 originalPath := storage.GetImagePath("original", originalFileName)

 dst, err := os.Create(originalPath)
 if err != nil {
  log.Printf("Error creating destination file: %v", err)
  http.Error(w, "Failed to save image", http.StatusInternalServerError)
  return
 }
 defer dst.Close()

 if _, err := io.Copy(dst, file); err != nil {
  log.Printf("Error copying image data: %v", err)
  http.Error(w, "Failed to write image data", http.StatusInternalServerError)
  return
 }

 w.WriteHeader(http.StatusCreated)
 fmt.Fprintf(w, `{"status": "success", "imageId": "%s"}`, imageID)
}

Retrieval and on-the-fly processing:

// internal/handlers/retrieve.go
package handlers

import (
 "fmt"
 "log"
 "mini-cloudflare-images/internal/cache"
 "mini-cloudflare-images/internal/ffmpeg"
 "mini-cloudflare-images/internal/storage"
 "net/http"
 "os"
 "strconv"
)

var imgCache = cache.New()

func RetrieveImage(w http.ResponseWriter, r *http.Request) {
 imageID := r.PathValue("id")
 widthStr := r.URL.Query().Get("width")

 // If no width is specified, serve the original
 if widthStr == "" {
  originalPath := storage.GetImagePath("original", imageID+".jpg")
  if _, err := os.Stat(originalPath); os.IsNotExist(err) {
   http.NotFound(w, r)
   return
  }
  http.ServeFile(w, r, originalPath)
  return
 }

 width, err := strconv.Atoi(widthStr)
 if err != nil {
  http.Error(w, "Invalid width parameter", http.StatusBadRequest)
  return
 }

 cacheKey := fmt.Sprintf("%s_w%d", imageID, width)
 if cachedPath, found := imgCache.Get(cacheKey); found {
  http.ServeFile(w, r, cachedPath.(string))
  return
 }

 originalPath := storage.GetImagePath("original", imageID+".jpg")
 processedFileName := fmt.Sprintf("%s_w%d.jpg", imageID, width)
 processedPath := storage.GetImagePath("processed", processedFileName)

 if err := ffmpeg.ResizeImage(originalPath, processedPath, width); err != nil {
  log.Printf("Error processing image %s: %v", imageID, err)
  http.Error(w, "Failed to process image", http.StatusInternalServerError)
  return
 }

 imgCache.Set(cacheKey, processedPath)
 http.ServeFile(w, r, processedPath)
}

Deleting:

// internal/handlers/delete.go
package handlers

import (
 "fmt"
 "log"
 "mini-cloudflare-images/internal/storage"
 "net/http"
)

func DeleteImage(w http.ResponseWriter, r *http.Request) {
 imageID := r.PathValue("id")
 originalFileName := imageID + ".jpg"

 if err := storage.DeleteImage("original", originalFileName); err != nil {
  log.Printf("Error deleting image %s: %v", imageID, err)
  http.Error(w, "Failed to delete image", http.StatusInternalServerError)
  return
 }

 // Optional: You could also iterate and delete all processed variants
 // For simplicity, we are only deleting the original here.

 w.WriteHeader(http.StatusOK)
 fmt.Fprintf(w, `{"status": "success", "message": "Image %s deleted"}`, imageID)
}

Tying It All Together

Finally, we'll set up our routes, initialize the necessary directories, and start the web server. We'll use Go's built-in http.ServeMux for routing.

// cmd/main.go
package main

import (
 "log"
 "mini-cloudflare-images/internal/handlers"
 "mini-cloudflare-images/internal/storage"
 "net/http"
)

func main() {
 if err := storage.EnsureStorageDirs(); err != nil {
  log.Fatalf("Could not create storage directories: %v", err)
 }

 mux := http.NewServeMux()

 // The 'images' dir is our simulated CDN/File Server
 // This serves files directly from the 'images/processed' directory
 fs := http.FileServer(http.Dir("./images/processed"))
 mux.Handle("/cdn/", http.StripPrefix("/cdn/", fs))

 // API Routes
 mux.HandleFunc("POST /upload", handlers.UploadImage)
 mux.HandleFunc("GET /images/{id}", handlers.RetrieveImage)
 mux.HandleFunc("DELETE /delete/{id}", handlers.DeleteImage)

 log.Println("Starting server on :8080")
 if err := http.ListenAndServe(":8080", mux); err != nil {
  log.Fatalf("Could not start server: %s\n", err)
 }
}

IF YOU WANTED TO SKIP THE CODE YOU CAN STOP HERE

Systems Design Considerations

This simple prototype is cool and all, but building a system that can handle millions of requests requires us to think about how each component will scale. We'll explore the architectural decisions and trade-offs involved:

1. Storage

Local Disk Doesn't Scale

Our simple implementation writes files directly to a local images/ directory. This is a critical flaw for any system that needs to run on more than one server.

If we run two instances of our Go API server behind a load balancer, which server gets the uploaded image? If Server A handles the upload, the image is stored on its local disk. When a retrieval request for that same image comes in but gets routed to Server B, Server B won't find the file, leading to a 404 error. This makes our application "stateful," meaning each server holds unique data that others don't.

A Distributed Solution

The industry-standard solution is to use a distributed object storage service like AWS S3, Google Cloud Storage, or DigitalOcean Spaces.

Benefits:

Stateless: By offloading storage to an external, shared service, our Go API servers become stateless. Any server can handle any request because the source of truth for images is now a central, highly-available, and durable location.
Decoupling: Storage scale independently of our compute services, depending on our needs.
Durability & Availability: These services are designed for extreme durability (e.g., S3's eleven 9's) by automatically replicating data across multiple physical locations.

2. Synchronous vs Asynchronous Processing

Synchronous

When a user uploads an image, we need to process it to create different sizes or formats. We're doing this using ffmpeg. Our current design processes images "on-the-fly". When a request for a new image size arrives, the API handler blocks—doing nothing else—until ffmpeg finishes its work.

If we get a burst of requests for new image variants, the server's load will spike, and response times for all requests will skyrocket. This is synchronous processing, and it makes our system vulnerable to performance degradation and poor user experience for that first-time load.

Asynchronous: Worker Queues

A much more resilient and scalable system uses a message queue and a pool of dedicated workers.

Enqueue Job: When the API server receives an upload, instead of processing it immediately, it simply publishes a "job" message to a message queue like RabbitMQ, AWS SQS, or Google Pub/Sub. The message contains details like the image ID and the location of the original in object storage.
Asynchronous Workers: We run a separate fleet of services (our "workers"). Their only job is to pull messages from the queue and process.
Process and Store: When a worker gets a job, it downloads the original image from object storage, performs all the necessary processing (e.g., creates several standard sizes), and uploads the processed variants back to object storage.

This approach gives us:

Responsiveness: The user's upload request completes almost instantly because the API server's only job is to accept the file and create a job message.
Decoupling: The processing workload is completely decoupled from the API. If image uploads spike, we can simply scale up our worker fleet independently of the API servers to handle the load.
Resilience: If a worker fails mid-process, the message can be returned to the queue and picked up by another worker, ensuring the image eventually gets processed.

3. Cache

Our code uses a simple in-memory map with a mutex as a cache. This shares the exact same weakness as local disk storage: it's tied to a single server instance. If Server A processes image1_w300.jpg and caches the result, Server B knows nothing about it and will needlessly re-process the same image.

Mutli-Layer Distributed Cache

A multi-layered caching strategy is used to absorb traffic and reduce load.

Layer 1: CDN Edge Cache

A global CDN like Cloudflare or AWS CloudFront can cache images at the edge, close to users. This reduces latency and offloads traffic from our servers.

Layer 2: Distributed Metadata Cache

If our CDN cache misses, we can use onto Redis or Memcached to cache image metadata. Before attempting to process an image, the API server checks to see if the processed variant already exists. If it does, it can skip the processing step entirely.

Distributed caches gives us:

Shared State: ex: A Redis cluster becomes the single source of truth for our cache. Every server checks Redis before attempting to process an image.
Features: Features like Time-To-Live (TTL), which can be used to automatically evict stale cache entries.

4. All Together: Scalabale System Architecture

Considering these distributed principles, our production architecture would look vastly different from our initial prototype.

[User Request]-------------> |       Real CDN         |
                             | (e.g., Cloudflare, S3) |
                             +-----------+------------+
                                         | (Cache MISS)
                                         |
                       +-----------------v-----------------+
                       |           Load Balancer           |
                       +-----------------+-----------------+
                                         |
                  +----------------v----------------v----------------+
                  |  Go API Server 1 |  Go API Server 2 |  Go API Server N | (Stateless)
                  +----------------+----------------+----------------+
                    | (Upload)       | (Read Metadata) | (Delete)
                    |                |                 |
(Job Message) +-----v----------------+-----------------v----------+
              |     [Message Queue (e.g., SQS, RabbitMQ)]      |
              +--------------------+----------------------------+
                                   | (New Job)
+----------------------------------v----------------------------------+
|               Auto-Scaling Group of Processor Workers               |
| +----------------+   +----------------+   +----------------+        |
| |   Worker 1     |   |   Worker 2     |...|   Worker N     |        |
| +-------+--------+   +-------+--------+   +----------------+        |
|         | (Process)          | (Process)                            |
+---------+--------------------+--------------------------------------+
          |                    |
          |  (Read Original)   | (Write Processed)
          |                    |
+---------v--------------------v----------+   +------------------------+
|   Distributed Object Storage (S3)       |   | Distributed Cache (Redis)|
| - /originals                            |---| - /processed_variants  |
| - /processed                            |   +------------------------+
+-----------------------------------------+

This is better. It's horizontally scalable at every layer, resilient to individual component failures, and designed for high performance.

Should You Build Your Own?

For most of us, probably not. But at least it was fun to learn! Understanding the intricate architecture behind it empowers us as engineers to build better, more scalable systems of our own. Jokes aside, there ARE some reasons you might want to:

Cost: For high volume operations, rolling your own can be more cost effective
Privacy/Compliance: Keep your data entirely within your control

There's also much to improve on, such as observability. For a distributed system, comprehensive logging, monitoring, and tracing are critical! But that's for a different day.

That's all from me!

-Caleb

P.S. This article is not AI generated, only the bash diagram!