A systems design walkthrough of a mini Cloudflare Images service
Services like Cloudflare Images and Imgix offer a powerful, off-the-shelf solution for hosting and processing images. They abstract away the complexity, providing a simple API to solve a difficult problem. What architectural patterns and trade-offs are involved in designing a system that can ingest, process, and serve millions of images efficiently and reliably? I love systems design and I use Cloudflare often, so let's design a mini version of our own:
Core Functional Requirements:
Hopefully you'll learn basic systems design principles, such as handling object storage or caching strategies.
For learning purposes, let's pretend we already have a mini prototype that handles this. We'll keep the stack for our prototype minimal:
[User Request]
|
V
[Go API Server]
| |
| V
| [Image Storage] (e.g., local disk, S3-compatible storage)
| ^
| |
V |
[FFmpeg Processor]
|
V
[Cache Layer] (e.g., in-memory, Redis)
|
V
[CDN] --- Serves the cached or processed images
Basic project structure for our Go application:
└── mini-cloudflare-images/
├── cmd/
│ └── main.go # Entry point
├── internal/
│ ├── handlers/ # API Handlers
│ │ ├── upload.go
│ │ ├── retrieve.go
│ │ └── delete.go
│ ├── storage/ # Interacting with storage
│ │ └── local.go
│ ├── ffmpeg/ # Processing images
│ │ └── process.go
│ └── cache/ # Cache
│ └── cache.go
└── images/ # Directory to store images
├── original/
└── processed/
We'll keep it simple and use the local filesystem for storage. In a production system, you might want to use S3 or another cloud storage solution.
// internal/storage/local.go
package storage
import (
"fmt"
"os"
"path/filepath"
)
// Creates the necessary directories for storing images if they don't already exist
func EnsureStorageDirs() error {
if err := os.MkdirAll(filepath.Join("images", "original"), 0755); err != nil {
return fmt.Errorf("failed to create original images directory: %w", err)
}
if err := os.MkdirAll(filepath.Join("images", "processed"), 0755); err != nil {
return fmt.Errorf("failed to create processed images directory: %w", err)
}
return nil
}
// Returns the full path for an image given its ID and variant
func GetImagePath(variant, imageID string) string {
return filepath.Join("images", variant, imageID)
}
// Removes an image file from the specified variant directory
func DeleteImage(variant, imageID string) error {
path := GetImagePath(variant, imageID)
if err := os.Remove(path); err != nil && !os.IsNotExist(err) {
return fmt.Errorf("failed to delete image %s: %w", path, err)
}
return nil
}
To avoid reprocessing images every time they are requested, we'll add a simple in-memory cache. Note the sync.RWMutex used for safe concurent reads and writes.
// internal/cache/cache.go
package cache
import "sync"
// A simple thread-safe in-memory key-value store
type MemoryCache struct {
mu sync.RWMutex
items map[string]interface{}
}
func New() *MemoryCache {
return &MemoryCache{
items: make(map[string]interface{}),
}
}
// Note overwrites any existing item
func (c *MemoryCache) Set(key string, value interface{}) {
c.mu.Lock()
defer c.mu.Unlock()
c.items[key] = value
}
// Returns the item or nil, and a boolean indicating whether the key was found
func (c *MemoryCache) Get(key string) (interface{}, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
item, found := c.items[key]
return item, found
}
Uploading:
// internal/handlers/upload.go
package handlers
import (
"fmt"
"io"
"log"
"mini-cloudflare-images/internal/storage"
"net/http"
"os"
"github.com/google/uuid"
)
func UploadImage(w http.ResponseWriter, r *http.Request) {
file, _, err := r.FormFile("image")
if err != nil {
log.Printf("Error getting form file: %v", err)
http.Error(w, "Failed to get image from form", http.StatusBadRequest)
return
}
defer file.Close()
imageID := uuid.New().String()
// We'll store originals as .jpg for consistency
originalFileName := imageID + ".jpg"
originalPath := storage.GetImagePath("original", originalFileName)
dst, err := os.Create(originalPath)
if err != nil {
log.Printf("Error creating destination file: %v", err)
http.Error(w, "Failed to save image", http.StatusInternalServerError)
return
}
defer dst.Close()
if _, err := io.Copy(dst, file); err != nil {
log.Printf("Error copying image data: %v", err)
http.Error(w, "Failed to write image data", http.StatusInternalServerError)
return
}
w.WriteHeader(http.StatusCreated)
fmt.Fprintf(w, `{"status": "success", "imageId": "%s"}`, imageID)
}
Retrieval and on-the-fly processing:
// internal/handlers/retrieve.go
package handlers
import (
"fmt"
"log"
"mini-cloudflare-images/internal/cache"
"mini-cloudflare-images/internal/ffmpeg"
"mini-cloudflare-images/internal/storage"
"net/http"
"os"
"strconv"
)
var imgCache = cache.New()
func RetrieveImage(w http.ResponseWriter, r *http.Request) {
imageID := r.PathValue("id")
widthStr := r.URL.Query().Get("width")
// If no width is specified, serve the original
if widthStr == "" {
originalPath := storage.GetImagePath("original", imageID+".jpg")
if _, err := os.Stat(originalPath); os.IsNotExist(err) {
http.NotFound(w, r)
return
}
http.ServeFile(w, r, originalPath)
return
}
width, err := strconv.Atoi(widthStr)
if err != nil {
http.Error(w, "Invalid width parameter", http.StatusBadRequest)
return
}
cacheKey := fmt.Sprintf("%s_w%d", imageID, width)
if cachedPath, found := imgCache.Get(cacheKey); found {
http.ServeFile(w, r, cachedPath.(string))
return
}
originalPath := storage.GetImagePath("original", imageID+".jpg")
processedFileName := fmt.Sprintf("%s_w%d.jpg", imageID, width)
processedPath := storage.GetImagePath("processed", processedFileName)
if err := ffmpeg.ResizeImage(originalPath, processedPath, width); err != nil {
log.Printf("Error processing image %s: %v", imageID, err)
http.Error(w, "Failed to process image", http.StatusInternalServerError)
return
}
imgCache.Set(cacheKey, processedPath)
http.ServeFile(w, r, processedPath)
}
Deleting:
// internal/handlers/delete.go
package handlers
import (
"fmt"
"log"
"mini-cloudflare-images/internal/storage"
"net/http"
)
func DeleteImage(w http.ResponseWriter, r *http.Request) {
imageID := r.PathValue("id")
originalFileName := imageID + ".jpg"
if err := storage.DeleteImage("original", originalFileName); err != nil {
log.Printf("Error deleting image %s: %v", imageID, err)
http.Error(w, "Failed to delete image", http.StatusInternalServerError)
return
}
// Optional: You could also iterate and delete all processed variants
// For simplicity, we are only deleting the original here.
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, `{"status": "success", "message": "Image %s deleted"}`, imageID)
}
Finally, we'll set up our routes, initialize the necessary directories, and start the web server. We'll use Go's built-in http.ServeMux for routing.
// cmd/main.go
package main
import (
"log"
"mini-cloudflare-images/internal/handlers"
"mini-cloudflare-images/internal/storage"
"net/http"
)
func main() {
if err := storage.EnsureStorageDirs(); err != nil {
log.Fatalf("Could not create storage directories: %v", err)
}
mux := http.NewServeMux()
// The 'images' dir is our simulated CDN/File Server
// This serves files directly from the 'images/processed' directory
fs := http.FileServer(http.Dir("./images/processed"))
mux.Handle("/cdn/", http.StripPrefix("/cdn/", fs))
// API Routes
mux.HandleFunc("POST /upload", handlers.UploadImage)
mux.HandleFunc("GET /images/{id}", handlers.RetrieveImage)
mux.HandleFunc("DELETE /delete/{id}", handlers.DeleteImage)
log.Println("Starting server on :8080")
if err := http.ListenAndServe(":8080", mux); err != nil {
log.Fatalf("Could not start server: %s\n", err)
}
}
This simple prototype is cool and all, but building a system that can handle millions of requests requires us to think about how each component will scale. We'll explore the architectural decisions and trade-offs involved:
Our simple implementation writes files directly to a local images/ directory.
This is a critical flaw for any system that needs to run on more than one server.
If we run two instances of our Go API server behind a load balancer, which server gets the uploaded image? If Server A handles the upload, the image is stored on its local disk. When a retrieval request for that same image comes in but gets routed to Server B, Server B won't find the file, leading to a 404 error. This makes our application "stateful," meaning each server holds unique data that others don't.
The industry-standard solution is to use a distributed object storage service like AWS S3, Google Cloud Storage, or DigitalOcean Spaces.
Benefits:
When a user uploads an image, we need to process it to create different sizes or formats. We're doing this using ffmpeg. Our current design processes images "on-the-fly". When a request for a new image size arrives, the API handler blocks—doing nothing else—until ffmpeg finishes its work.
If we get a burst of requests for new image variants, the server's load will spike, and response times for all requests will skyrocket. This is synchronous processing, and it makes our system vulnerable to performance degradation and poor user experience for that first-time load.
A much more resilient and scalable system uses a message queue and a pool of dedicated workers.
Enqueue Job: When the API server receives an upload, instead of processing it immediately, it simply publishes a "job" message to a message queue like RabbitMQ, AWS SQS, or Google Pub/Sub. The message contains details like the image ID and the location of the original in object storage.
Asynchronous Workers: We run a separate fleet of services (our "workers"). Their only job is to pull messages from the queue and process.
Process and Store: When a worker gets a job, it downloads the original image from object storage, performs all the necessary processing (e.g., creates several standard sizes), and uploads the processed variants back to object storage.
This approach gives us:
Our code uses a simple in-memory map with a mutex as a cache. This shares the exact same weakness as local disk storage: it's tied to a single server instance. If Server A processes image1_w300.jpg and caches the result, Server B knows nothing about it and will needlessly re-process the same image.
A multi-layered caching strategy is used to absorb traffic and reduce load.
A global CDN like Cloudflare or AWS CloudFront can cache images at the edge, close to users. This reduces latency and offloads traffic from our servers.
If our CDN cache misses, we can use onto Redis or Memcached to cache image metadata. Before attempting to process an image, the API server checks to see if the processed variant already exists. If it does, it can skip the processing step entirely.
Distributed caches gives us:
Considering these distributed principles, our production architecture would look vastly different from our initial prototype.
[User Request]-------------> | Real CDN |
| (e.g., Cloudflare, S3) |
+-----------+------------+
| (Cache MISS)
|
+-----------------v-----------------+
| Load Balancer |
+-----------------+-----------------+
|
+----------------v----------------v----------------+
| Go API Server 1 | Go API Server 2 | Go API Server N | (Stateless)
+----------------+----------------+----------------+
| (Upload) | (Read Metadata) | (Delete)
| | |
(Job Message) +-----v----------------+-----------------v----------+
| [Message Queue (e.g., SQS, RabbitMQ)] |
+--------------------+----------------------------+
| (New Job)
+----------------------------------v----------------------------------+
| Auto-Scaling Group of Processor Workers |
| +----------------+ +----------------+ +----------------+ |
| | Worker 1 | | Worker 2 |...| Worker N | |
| +-------+--------+ +-------+--------+ +----------------+ |
| | (Process) | (Process) |
+---------+--------------------+--------------------------------------+
| |
| (Read Original) | (Write Processed)
| |
+---------v--------------------v----------+ +------------------------+
| Distributed Object Storage (S3) | | Distributed Cache (Redis)|
| - /originals |---| - /processed_variants |
| - /processed | +------------------------+
+-----------------------------------------+
This is better. It's horizontally scalable at every layer, resilient to individual component failures, and designed for high performance.
For most of us, probably not. But at least it was fun to learn! Understanding the intricate architecture behind it empowers us as engineers to build better, more scalable systems of our own. Jokes aside, there ARE some reasons you might want to:
There's also much to improve on, such as observability. For a distributed system, comprehensive logging, monitoring, and tracing are critical! But that's for a different day.
That's all from me!
-Caleb
P.S. This article is not AI generated, only the bash diagram!