All Posts
DjangoPythonArchitectureAI

Scaling Django for AI-Powered Applications: Lessons from Production

Huzaifa AtharJanuary 25, 20267 min read

Django Gets a Bad Rap for AI Workloads

I keep seeing people online say Django can't handle AI applications. "It's synchronous." "It's too slow." "Just use FastAPI." I've heard it all. And honestly, some of those concerns are valid if you're using Django the same way you'd build a basic CRUD app. But with the right architecture, Django handles AI workloads just fine. I know because I've been running it in production at DigitLabs for our chatbot platform.

Here's what actually works.

The Core Problem: LLM Calls Are Slow

The fundamental challenge with AI backends is that LLM API calls take anywhere from 2 to 30 seconds. If you make those calls inside a Django view, you're blocking a worker process the entire time. With a typical Gunicorn setup running 4-8 workers, it only takes a handful of concurrent users to exhaust your capacity.

The answer isn't to abandon Django. The answer is to never make LLM calls in your request/response cycle.

Celery and Redis: The Backbone of Everything

Every AI operation in our system goes through Celery. When a user sends a message to our chatbot, here's what actually happens:

1. The Django view receives the message and validates it. Fast, under 50ms. 2. It creates a database record for the conversation turn with status "processing." 3. It dispatches a Celery task to handle the LLM call. 4. It returns a 202 Accepted response immediately. 5. The frontend polls (or listens via WebSocket) for the result.

The Celery task does the heavy lifting: calling the LLM API, processing the response, running any tool calls, and updating the database with the final answer.

This pattern keeps Django responsive. Our p95 response time on the API is under 100ms because the views never block on AI operations.

A few Celery configuration lessons I learned the painful way:

  • Set `task_time_limit` and `task_soft_time_limit`. LLM APIs hang sometimes. Without timeouts, you'll have zombie workers consuming resources forever. I use 120 seconds soft limit and 180 seconds hard limit.
  • Use separate queues for different task priorities. We have a "chat" queue for user-facing messages (high priority) and a "batch" queue for background analytics and embedding generation (low priority). Different worker pools, different concurrency settings.
  • Set `task_acks_late=True` with `worker_prefetch_multiplier=1`. This ensures tasks aren't lost if a worker crashes mid-execution. The task stays in Redis until it's actually completed.

Database Optimization: The Stuff That Actually Matters

Our platform stores conversation histories, analytics data, and embedding metadata. The database (PostgreSQL) became a bottleneck faster than I expected.

Indexing conversations correctly. Our most common query pattern is "get recent conversations for a specific chatbot." Simple, right? But without the right composite index, this query was doing sequential scans on a table with millions of rows. A composite index on `(chatbot_id, created_at DESC)` brought query time from 800ms to 3ms.

Partitioning analytics tables. We store event data for analytics dashboards. After about 6 months, the analytics table hit 50 million rows and even indexed queries were getting slow. I partitioned the table by month using PostgreSQL's native declarative partitioning. Queries that filter by date range now only scan the relevant partitions.

Avoiding N+1 queries in the API. Django's ORM makes it really easy to accidentally generate hundreds of queries. I use `select_related` and `prefetch_related` aggressively, and I added Django Debug Toolbar in development to catch N+1 issues before they reach production. We also use `.only()` and `.defer()` to avoid loading large text fields (like full conversation transcripts) when we only need metadata.

Connection Pooling: pgBouncer Saved Us

Django creates a new database connection for every request by default. Under load, we were hitting PostgreSQL's connection limit constantly. The database would reject new connections, and the whole app would grind to a halt.

I added pgBouncer in transaction pooling mode between Django and PostgreSQL. Configuration that works well for us:

  • `default_pool_size = 25`
  • `max_client_conn = 200`
  • `pool_mode = transaction`

This reduced our active database connections from 100+ (one per Gunicorn worker plus Celery workers) to about 25, while actually improving throughput. The connection overhead was a bigger deal than I realized.

One gotcha: Django's persistent connections (`CONN_MAX_AGE`) don't play well with pgBouncer in transaction mode. Set `CONN_MAX_AGE = 0` in your Django settings when using pgBouncer, or you'll get stale connection errors that are really annoying to debug.

Caching Strategies That Work

We cache at multiple layers:

  • Redis for session and conversation state. Active conversation contexts are stored in Redis with a 30-minute TTL. This avoids hitting the database on every message in an ongoing conversation.
  • Django's cache framework for API responses. Analytics dashboard endpoints that aggregate data are cached for 5 minutes. The data doesn't change that fast, and regenerating those aggregations is expensive.
  • LLM response caching. For identical prompts (which happen more often than you'd think, especially with system prompts), we cache the LLM response with a content-based hash key. This saves real money on API costs.

One thing I stopped doing: caching at the template level. Our frontend is a separate React app, so Django template caching is irrelevant. If you're still rendering templates, it's worth looking into, but for API-first Django apps, focus your caching on the data layer.

Monitoring: What to Watch

I use a combination of Sentry, Prometheus, and Grafana. The metrics I actually look at daily:

  • Celery queue depth. If the chat queue has more than 10 pending tasks, something is wrong. Either the LLM API is slow or we need more workers.
  • Task failure rate. We aim for under 1% failure rate on chat tasks. Most failures are LLM API timeouts or rate limits.
  • Database query time (p95). If this creeps above 50ms, I start investigating. Usually it's a missing index or a query that needs optimization.
  • Memory usage per worker. Python processes can leak memory, especially when processing large conversation contexts. I set `max_requests = 1000` in Gunicorn to recycle workers periodically.

Sentry catches exceptions and gives us stack traces with context. Prometheus collects time-series metrics. Grafana dashboards make it visible. Nothing fancy, but it works.

The "Just Use FastAPI" Crowd

Look, FastAPI is great. I use it for some microservices where async I/O is the core requirement. But Django gives you the admin panel, the ORM, the migration system, the authentication framework, the middleware ecosystem. For a full product with user management, analytics, and complex business logic, Django saves an enormous amount of time.

The key insight is that Django doesn't need to be async to handle AI workloads. It just needs to delegate the slow parts to background workers. That's what Celery is for. This pattern has been battle-tested for over a decade. It works.

Our platform handles thousands of conversations per day on a pretty modest setup: 4 Gunicorn workers, 8 Celery workers, one PostgreSQL instance, one Redis instance. Total infrastructure cost is under $200/month. Try telling me Django doesn't scale.

Want to work together?

I'm always open to new projects and opportunities.

Get in touch