Jun 18, 2025
Design your own (mini) Cloudflare Images
Services like Cloudflare Images and Imgix offer a powerful, off-the-shelf solution for hosting and processing images. They abstract away the complexity, providing a simple API to solve a difficult problem. What architectural patterns and trade-offs are involved in designing a system that can ingest, process, and serve millions of images efficiently and reliably? I love systems design and I use Cloudflare often, so let's design a mini version of our own:
Core Functional Requirements:
- Image Ingestion: A write-heavy endpoint capable of handling high-throughput uploads.
- Persistent Storage: A durable, highly-available, and scalable object store for original, high-quality images.
- On-the-Fly Transformation: Real-time image processing for resizing, cropping, watermarking, and format conversion (e.g., JPEG to WebP/AVIF).
- Global Content Delivery: Low-latency delivery of processed images to users worldwide.
- High Availability & Fault Tolerance: The system must be resilient to component failures.
Hopefully you'll learn basic systems design principles, such as handling object storage or caching strategies.
A Simple Implementation
For learning purposes, let's pretend we already have a mini prototype that handles this. We'll keep the stack for our prototype minimal:
- Go: The choice for the API service, it's fast, simple, has built-in concurrency, and a strong standard library.
- FFmpeg - A powerhouse for image processing.
[User Request]
|
V
[Go API Server]
| |
| V
| [Image Storage] (e.g., local disk, S3-compatible storage)
| ^
| |
V |
[FFmpeg Processor]
|
V
[Cache Layer] (e.g., in-memory, Redis)
|
V
[CDN] --- Serves the cached or processed images
Project Structure
Basic project structure for our Go application:
└── mini-cloudflare-images/
├── cmd/
│ └── main.go # Entry point
├── internal/
│ ├── handlers/ # API Handlers
│ │ ├── upload.go
│ │ ├── retrieve.go
│ │ └── delete.go
│ ├── storage/ # Interacting with storage
│ │ └── local.go
│ ├── ffmpeg/ # Processing images
│ │ └── process.go
│ └── cache/ # Cache
│ └── cache.go
└── images/ # Directory to store images
├── original/
└── processed/
Storage Layer
We'll keep it simple and use the local filesystem for storage. In a production system, you might want to use S3 or another cloud storage solution.
// internal/storage/local.go
package storage
import (
"fmt"
"os"
"path/filepath"
)
// Creates the necessary directories for storing images if they don't already exist
func EnsureStorageDirs() error {
if err := os.MkdirAll(filepath.Join("images", "original"), 0755); err != nil {
return fmt.Errorf("failed to create original images directory: %w", err)
}
if err := os.MkdirAll(filepath.Join("images", "processed"), 0755); err != nil {
return fmt.Errorf("failed to create processed images directory: %w", err)
}
return nil
}
// Returns the full path for an image given its ID and variant
func GetImagePath(variant, imageID string) string {
return filepath.Join("images", variant, imageID)
}
// Removes an image file from the specified variant directory
func DeleteImage(variant, imageID string) error {
path := GetImagePath(variant, imageID)
if err := os.Remove(path); err != nil && !os.IsNotExist(err) {
return fmt.Errorf("failed to delete image %s: %w", path, err)
}
return nil
}
Memory Cache
To avoid reprocessing images every time they are requested, we'll add a simple in-memory cache. Note the sync.RWMutex
used for safe concurent reads and writes.
// internal/cache/cache.go
package cache
import "sync"
// A simple thread-safe in-memory key-value store
type MemoryCache struct {
mu sync.RWMutex
items map[string]interface{}
}
func New() *MemoryCache {
return &MemoryCache{
items: make(map[string]interface{}),
}
}
// Note overwrites any existing item
func (c *MemoryCache) Set(key string, value interface{}) {
c.mu.Lock()
defer c.mu.Unlock()
c.items[key] = value
}
// Returns the item or nil, and a boolean indicating whether the key was found
func (c *MemoryCache) Get(key string) (interface{}, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
item, found := c.items[key]
return item, found
}
API Handlers
Uploading:
// internal/handlers/upload.go
package handlers
import (
"fmt"
"io"
"log"
"mini-cloudflare-images/internal/storage"
"net/http"
"os"
"github.com/google/uuid"
)
func UploadImage(w http.ResponseWriter, r *http.Request) {
file, _, err := r.FormFile("image")
if err != nil {
log.Printf("Error getting form file: %v", err)
http.Error(w, "Failed to get image from form", http.StatusBadRequest)
return
}
defer file.Close()
imageID := uuid.New().String()
// We'll store originals as .jpg for consistency
originalFileName := imageID + ".jpg"
originalPath := storage.GetImagePath("original", originalFileName)
dst, err := os.Create(originalPath)
if err != nil {
log.Printf("Error creating destination file: %v", err)
http.Error(w, "Failed to save image", http.StatusInternalServerError)
return
}
defer dst.Close()
if _, err := io.Copy(dst, file); err != nil {
log.Printf("Error copying image data: %v", err)
http.Error(w, "Failed to write image data", http.StatusInternalServerError)
return
}
w.WriteHeader(http.StatusCreated)
fmt.Fprintf(w, `{"status": "success", "imageId": "%s"}`, imageID)
}
Retrieval and on-the-fly processing:
// internal/handlers/retrieve.go
package handlers
import (
"fmt"
"log"
"mini-cloudflare-images/internal/cache"
"mini-cloudflare-images/internal/ffmpeg"
"mini-cloudflare-images/internal/storage"
"net/http"
"os"
"strconv"
)
var imgCache = cache.New()
func RetrieveImage(w http.ResponseWriter, r *http.Request) {
imageID := r.PathValue("id")
widthStr := r.URL.Query().Get("width")
// If no width is specified, serve the original
if widthStr == "" {
originalPath := storage.GetImagePath("original", imageID+".jpg")
if _, err := os.Stat(originalPath); os.IsNotExist(err) {
http.NotFound(w, r)
return
}
http.ServeFile(w, r, originalPath)
return
}
width, err := strconv.Atoi(widthStr)
if err != nil {
http.Error(w, "Invalid width parameter", http.StatusBadRequest)
return
}
cacheKey := fmt.Sprintf("%s_w%d", imageID, width)
if cachedPath, found := imgCache.Get(cacheKey); found {
http.ServeFile(w, r, cachedPath.(string))
return
}
originalPath := storage.GetImagePath("original", imageID+".jpg")
processedFileName := fmt.Sprintf("%s_w%d.jpg", imageID, width)
processedPath := storage.GetImagePath("processed", processedFileName)
if err := ffmpeg.ResizeImage(originalPath, processedPath, width); err != nil {
log.Printf("Error processing image %s: %v", imageID, err)
http.Error(w, "Failed to process image", http.StatusInternalServerError)
return
}
imgCache.Set(cacheKey, processedPath)
http.ServeFile(w, r, processedPath)
}
Deleting:
// internal/handlers/delete.go
package handlers
import (
"fmt"
"log"
"mini-cloudflare-images/internal/storage"
"net/http"
)
func DeleteImage(w http.ResponseWriter, r *http.Request) {
imageID := r.PathValue("id")
originalFileName := imageID + ".jpg"
if err := storage.DeleteImage("original", originalFileName); err != nil {
log.Printf("Error deleting image %s: %v", imageID, err)
http.Error(w, "Failed to delete image", http.StatusInternalServerError)
return
}
// Optional: You could also iterate and delete all processed variants
// For simplicity, we are only deleting the original here.
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, `{"status": "success", "message": "Image %s deleted"}`, imageID)
}
Tying It All Together
Finally, we'll set up our routes, initialize the necessary directories, and start the web server. We'll use Go's built-in http.ServeMux
for routing.
// cmd/main.go
package main
import (
"log"
"mini-cloudflare-images/internal/handlers"
"mini-cloudflare-images/internal/storage"
"net/http"
)
func main() {
if err := storage.EnsureStorageDirs(); err != nil {
log.Fatalf("Could not create storage directories: %v", err)
}
mux := http.NewServeMux()
// The 'images' dir is our simulated CDN/File Server
// This serves files directly from the 'images/processed' directory
fs := http.FileServer(http.Dir("./images/processed"))
mux.Handle("/cdn/", http.StripPrefix("/cdn/", fs))
// API Routes
mux.HandleFunc("POST /upload", handlers.UploadImage)
mux.HandleFunc("GET /images/{id}", handlers.RetrieveImage)
mux.HandleFunc("DELETE /delete/{id}", handlers.DeleteImage)
log.Println("Starting server on :8080")
if err := http.ListenAndServe(":8080", mux); err != nil {
log.Fatalf("Could not start server: %s\n", err)
}
}
IF YOU WANTED TO SKIP THE CODE YOU CAN STOP HERE
Systems Design Considerations
This simple prototype is cool and all, but building a system that can handle millions of requests requires us to think about how each component will scale. We'll explore the architectural decisions and trade-offs involved:
1. Storage
Local Disk Doesn't Scale
Our simple implementation writes files directly to a local images/
directory.
This is a critical flaw for any system that needs to run on more than one server.
If we run two instances of our Go API server behind a load balancer, which server gets the uploaded image? If Server A handles the upload, the image is stored on its local disk. When a retrieval request for that same image comes in but gets routed to Server B, Server B won't find the file, leading to a 404 error. This makes our application "stateful," meaning each server holds unique data that others don't.
A Distributed Solution
The industry-standard solution is to use a distributed object storage service like AWS S3, Google Cloud Storage, or DigitalOcean Spaces.
Benefits:
- Stateless: By offloading storage to an external, shared service, our Go API servers become stateless. Any server can handle any request because the source of truth for images is now a central, highly-available, and durable location.
- Decoupling: Storage scale independently of our compute services, depending on our needs.
- Durability & Availability: These services are designed for extreme durability (e.g., S3's eleven 9's) by automatically replicating data across multiple physical locations.
2. Synchronous vs Asynchronous Processing
Synchronous
When a user uploads an image, we need to process it to create different sizes or formats. We're doing this using ffmpeg. Our current design processes images "on-the-fly". When a request for a new image size arrives, the API handler blocks—doing nothing else—until ffmpeg finishes its work.
If we get a burst of requests for new image variants, the server's load will spike, and response times for all requests will skyrocket. This is synchronous processing, and it makes our system vulnerable to performance degradation and poor user experience for that first-time load.
Asynchronous: Worker Queues
A much more resilient and scalable system uses a message queue and a pool of dedicated workers.
-
Enqueue Job: When the API server receives an upload, instead of processing it immediately, it simply publishes a "job" message to a message queue like RabbitMQ, AWS SQS, or Google Pub/Sub. The message contains details like the image ID and the location of the original in object storage.
-
Asynchronous Workers: We run a separate fleet of services (our "workers"). Their only job is to pull messages from the queue and process.
-
Process and Store: When a worker gets a job, it downloads the original image from object storage, performs all the necessary processing (e.g., creates several standard sizes), and uploads the processed variants back to object storage.
This approach gives us:
- Responsiveness: The user's upload request completes almost instantly because the API server's only job is to accept the file and create a job message.
- Decoupling: The processing workload is completely decoupled from the API. If image uploads spike, we can simply scale up our worker fleet independently of the API servers to handle the load.
- Resilience: If a worker fails mid-process, the message can be returned to the queue and picked up by another worker, ensuring the image eventually gets processed.
3. Cache
Our code uses a simple in-memory map with a mutex as a cache. This shares the exact same weakness as local disk storage: it's tied to a single server instance. If Server A processes image1_w300.jpg
and caches the result, Server B knows nothing about it and will needlessly re-process the same image.
Mutli-Layer Distributed Cache
A multi-layered caching strategy is used to absorb traffic and reduce load.
Layer 1: CDN Edge Cache
A global CDN like Cloudflare or AWS CloudFront can cache images at the edge, close to users. This reduces latency and offloads traffic from our servers.
Layer 2: Distributed Metadata Cache
If our CDN cache misses, we can use onto Redis or Memcached to cache image metadata. Before attempting to process an image, the API server checks to see if the processed variant already exists. If it does, it can skip the processing step entirely.
Distributed caches gives us:
- Shared State: ex: A Redis cluster becomes the single source of truth for our cache. Every server checks Redis before attempting to process an image.
- Features: Features like Time-To-Live (TTL), which can be used to automatically evict stale cache entries.
4. All Together: Scalabale System Architecture
Considering these distributed principles, our production architecture would look vastly different from our initial prototype.
[User Request]-------------> | Real CDN |
| (e.g., Cloudflare, S3) |
+-----------+------------+
| (Cache MISS)
|
+-----------------v-----------------+
| Load Balancer |
+-----------------+-----------------+
|
+----------------v----------------v----------------+
| Go API Server 1 | Go API Server 2 | Go API Server N | (Stateless)
+----------------+----------------+----------------+
| (Upload) | (Read Metadata) | (Delete)
| | |
(Job Message) +-----v----------------+-----------------v----------+
| [Message Queue (e.g., SQS, RabbitMQ)] |
+--------------------+----------------------------+
| (New Job)
+----------------------------------v----------------------------------+
| Auto-Scaling Group of Processor Workers |
| +----------------+ +----------------+ +----------------+ |
| | Worker 1 | | Worker 2 |...| Worker N | |
| +-------+--------+ +-------+--------+ +----------------+ |
| | (Process) | (Process) |
+---------+--------------------+--------------------------------------+
| |
| (Read Original) | (Write Processed)
| |
+---------v--------------------v----------+ +------------------------+
| Distributed Object Storage (S3) | | Distributed Cache (Redis)|
| - /originals |---| - /processed_variants |
| - /processed | +------------------------+
+-----------------------------------------+
This is better. It's horizontally scalable at every layer, resilient to individual component failures, and designed for high performance.
Should You Build Your Own?
For most of us, probably not. But at least it was fun to learn! Understanding the intricate architecture behind it empowers us as engineers to build better, more scalable systems of our own. Jokes aside, there ARE some reasons you might want to:
- Cost: For high volume operations, rolling your own can be more cost effective
- Privacy/Compliance: Keep your data entirely within your control
There's also much to improve on, such as observability. For a distributed system, comprehensive logging, monitoring, and tracing are critical! But that's for a different day.
That's all from me!
-Caleb
P.S. This article is not AI generated, only the bash diagram!