DECISION NODE

May 9, 2024

Back to index

Open in Eraser

May 9, 2024

Demystifying NoSQL vs SQL (It's not what you might think)

Ken Kohlmann

,ㅤ

Founder @ Code to the Moon

Ken is the founder of Code to the Moon, a YouTube channel discussing software development. He is also a LinkedIn Learning Instructor. Prior to all of this, he was a Software Development Engineer at Amazon.com. He is also on Twitter.

SQL still stands today, over 50 years after its inception, as one of the most popular ways of interacting with databases. But in those 50 years, the software development landscape has shifted in ways that have catalyzed the popularity of other approaches too.

Still reveling in the power and beauty of the language, I personally prefer SQL over any other way of interacting with a database, and it's not even close. But I try to avoid SQL databases unless I absolutely have to use them.

The generic, nuance-dismissing "NoSQL" has become a popular umbrella term for many of these approaches. While accurate in that these systems do not leverage the SQL language, the distinguishing factors go far deeper than the language used to interact with the database.

In fact, I'd argue that the degree to which the term "NoSQL" is misleading is on par with the travesty of "JavaScript" being nominally chained to another language with which it has no underlying relation. Despite that, I will resign to the colloquial use of the term for the remainder of this discussion.

Let's unpack all of this - including how I personally decide which type of database to use for an application, which is often at odds with my undying love for SQL. Some things this exploration will touch on are

Query Flexibility
Scaling Concerns
Read performance
Query Consistency
Schema Design

What Is A Normalized Schema?

Back in the early days of SQL, when memory was scarce and traffic volume was low (at least relative to today's behemoths) a normalized schema became the default approach.

A normalized schema involves having separate tables for distinct but related entities.

Consider a database that is used to track employee tasks and who those tasks are assigned to. Each task is assigned to one person, but a single person may be assigned more than one task. We could simply store the assigned employee's name with each task:

But there are some issues with this approach:

It duplicates data - the employee's name is duplicated across all tasks to which the employee is assigned
Since employee names are not necessarily static, a change to an employee name would require an update to all records assigned to that employee

A normalized schema would separate tasks and employees into separate tables. We'd have an assigned_to field in the task table which refers to a record in the employee table:

This normalized approach resolves both of the aforementioned issues - because the employee's name is stored in the employee table, it is never duplicated even if there are multiple tasks assigned to that employee. If the employee's last name changes, accommodating the change only requires a single update operation. In this way, storage efficiency and write performance are optimized.

Normalization is great for ensuring the data is as small as possible on disk. It's also great for minimizing the number of writes that have to be performed to update a piece of data, even if that data is referenced by other data. But it is not necessarily optimal for read performance.

It is quite possible to store denormalized data in a SQL database. But the ecology of features, development effort, and educational resources around SQL databases typically assumes the use of a normalized schema. For that reason, "using a SQL database" and "using a normalized schema" have become inexorably linked.

Changing Tides

By the late 2000s, several factors began influencing what developers wanted out of their databases

Extremely cheap storage costs (relative to decades earlier)
A desire to maximize read performance, and a willingness to sacrifice other things to do so
Data volumes that exceed what can be stored on a single host

The key assumptions of normalization is that storage efficiency and write performance are to be prioritized above all else. Decades ago, when database storage capacity was measured in kilobytes and megabytes, those assumptions were reasonable. With such storage constraints, a denormalized schema may not have been feasible.

For many of today's applications, a normalized schema may still be the best approach. But application developers that need the maximum possible read performance and/or an ability to accommodate a massive volume of data should think twice about using a normalized schema.

Duplication of data in exchange for faster query responses and the ability to scale horizontally (add more nodes to an existing database system) has become a sensible approach for many types of applications.

Downsides of Normalization

Read Performance

For queries which reference a single table using an indexed column, there isn't a problem. But when data is normalized, it's all but inevitable that an application will, in a single query, need to merge data sourced from separate tables. Enter the infamous JOIN operation, which does exactly that. While there are always exceptions, JOIN is generally regarded as a potential detriment to query performance. Given incentives to make a particular read query faster, SQL developers will instinctively look into whether it's possible to prune out any JOIN operations.

Horizontal Scalability

Horizontal scaling is the concept of increasing the capacity of a system simply by adding more nodes. This is in contrast to "vertical" scaling, where capacity is increased by upgrading the capability of existing nodes, typically via moving to better hardware.

There are two main approaches to horizontally scaling a database system, each of which has its own pros and cons.

Sharding

Sharding is the concept of splitting data across multiple database nodes, which can potentially

Allow read and write requests to be load balanced across nodes
Allow data to grow larger than the capacity of a single node

The case could be made that sharding is the only holistic approach to horizontal scaling, in that it simultaneously addresses both increasing storage capacity and traffic volume requirements.

There can be some challenges incorporating sharding into a normalized schema.

Say we have normalized data spread across 3 sharded tables that are all inter-related, and we have a heavily used read query that performs a join across all of them. In the worst case scenario, a read would first have to be done on 3 distinct database nodes, followed by a subsequent merge of the results. The problem here is not dissimilar to the performance impact a JOIN may have in a single node scenario - the point here is that it can wind up negating some of the load balancing benefits of sharding.

So while it is possible to shard normalized data, doing so requires careful design by developers such that the desired benefits are realized and performance chasms avoided.

Replication

In contrast to sharding, replication aims to duplicate an entire dataset across multiple nodes. Obviously this doesn't enable handling larger volumes of data, but it can help with the load balancing of requests and fault tolerance.

A common pattern seen in the SQL world (especially for read-heavy applications) is the use of read-only replica nodes. This enables the application to load balance read queries across the read-only replicas, but all write queries must be sent to a "master" node. Fortunately, having normalized data doesn't introduce any unique obstacles to scaling via replication.

Increased reliance on atomicity and consistency

Atomicity and query consistency are some of the hallmark features of SQL databases, and are often viewed as big selling points. But they come at a performance cost, and the use of a normalized schema can increase reliance on them.

- Consistency

In this context, "consistency" refers to data updates being reflected in all read queries that occur after a write query. That might sound like something that can be taken for granted, but these consistency guarantees are not "free" - they incur a performance penalty.

When queries extract data from disparate sources (tables), it becomes more important for the database to guarantee consistency. After a data update, it is critical that all subsequent read operations reflect the new state of the data. Otherwise, the result of subsequent queries may only partially reflect the new state of the data, leading to nondeterministic behavior.

- Atomicity

SQL databases provide the ability to wrap multiple operations into an atomic "transaction". A guarantee is made that the entire transaction will either execute successfully and the data updated to reflect all operations, or it will fail and data will remain in its original state. A state where some of the operations in the transaction succeed but others fail is simply not possible. Again - this comes at a cost to performance.

As with the need for consistency, the reliance on atomicity can be exacerbated by the use of a normalized schema. Because application state can be spread across multiple tables, updating said state safely may depend on lockstep updates to those tables.

Advantages of Denormalized Data

Most "NoSQL" databases - MongoDB, DynamoDB, Redis, Cassandra, etc - either require or strongly encourage a denormalized schema. While some of them do have support for operations akin to an SQL JOIN, relying heavily on such features is generally considered to be "going against the grain".

In the SQL world, schema design is driven by the nature of the relationships between the entities that need to be stored. In the NoSQL world, developers are encouraged to let the schema design be driven by their application's access patterns. That can require a bit of a mindset shift. In other words, your schema should cater directly to your application's access patterns, accommodating them with as few queries as possible. Because we can't rely on JOIN, that usually means consolidating all data consumed by an access pattern into one table. Continuing with the aforementioned Employee/Task example, that might mean storing the same employee name along with each task assigned to that employee.

Then there's the part that SQL aficionados can get downright offended by. It's perfectly acceptable to store different types of records in the same table. What?? I have traumatic memories from the first time I was introduced to this idea. We'll get to this.

So what do all of these extra constraints get us? Predictable, ultra fast query performance. Hands-free horizontal scaling and fault tolerance. In the case of DynamoDB, you can theoretically scale from 0 users to 100 million users without any database schema or configuration changes. Contrast that to the surgical database choreography that the Figma team had to perform to get their PostgreSQL based stack to accommodate a growing user base.

It's a seductive prospect even if you have no reason to believe you'll have that many users.

We mentioned horizontal scaling via sharding earlier. Sharding is the only conduit through which we can build systems that aren't riddled with "as long as table X doesn't grow larger than Y, we'll be fine" time bombs.

Some NoSQL databases can perform this sharding automatically and in real time, with no configuration from the developer or interruption in service. Also, when data is in a form that negates the need to merge from separate sources, it becomes more feasible to guarantee that a given read query will only need to pull data from one shard. That, combined with automatic sharding, can be a beautiful thing.

My Process For Choosing Between SQL and NoSQL

I have a very high level process for choosing between database paradigms. This decision tree deliberately overlooks some of the nuances of the decision, in the interest of distilling it down to the most important components. Of course, there will be situations where criteria other than these become deciding factors.

Notice what isn't here. Nowhere on the flow is "Strict vs dynamic schema". The aspect that many might consider to be the hallmark distinction between SQL and NoSQL databases. A strict schema can be a handy guardrail in situations where it makes sense, but I find it hard to think of scenarios where it is a "must have". That's in part because those constraints can be implemented at the application layer. Similarly, it is possible to store freeform data in databases even if they assume a strict schema.

Let's look more closely at these decision nodes.

Guarantee that data will fit on a single host forever

There are some situations where you might have reasonable guarantees that your user base will never grow such that your data is too large to fit on a single database node. Maybe you're building an internal application for a small-medium sized company. Or maybe you have a large user base, but you can prove that your data will still be small. In most cases it probably makes sense to just use a SQL database. Enjoy the beautiful language. Seriously, I'm jealous.

But I find that this rationale can be challenging to justify, partly because it is difficult to be absolutely certain that scalability will never be required. It feels like the analog of neglecting to handle a corner case in your code because "that will never happen!". As developers, we strive to handle the "unlikely but possible" scenarios in our code. The prospect of "having to scale up", no matter how remote, feels like another "unlikely but possible" scenario that most systems should account for.

Denormalization would yield a combinatorial explosion of duplicate data

Probably the most common reason to use a SQL database. The nature of some access patterns makes it infeasible to denormalize the underlying data due to a combinatorial explosion.

Consider an application where the user is given a rich set of tuning knobs and configuration to control a visualization of the underlying data. If the flexibility to view the data from many different perspectives is needed, there's a good chance that pre-computing and storing all of these perspectives in a denormalized form may not be feasible.

In other words, denormalizing the data to accommodate the access patterns would result in a combinatorial explosion of duplicated data. This may be impossible both from a storage requirement perspective and because of the obscene number of write requests that would be necessary whenever a piece of duplicated data needs to be updated.

NoSQL databases encourage you to design your schema around your access patterns. Even if your access patterns are modest in the beginning, the possibility of requiring unforeseen access patterns in the future may be problematic. In the case of DynamoDB, there is the option of adding more Global Secondary Indexes to a table to accommodate these new access patterns. But doing so can increase costs.

Secondary Considerations

Strict vs Dynamic Schema

Yes, it's here. But it's under "secondary considerations". It's something to consider, but rarely have I seen cases where it should be the focal point of the decision. If your data has a consistent, required set of fields, a strict schema may be preferable. If each record of a given entity is diverse in the fields that are actually populated, a dynamic schema might make more sense.

Exploratory ad-hoc queries

There are many situations where one might want to poke around in the data to see what's going on. This can be for insights into the business - what percentage of users do X, what is the most popular Y, etc. They might be conducive to technical troubleshooting - "do all users experiencing bug X have configuration Y enabled?". Queries that merge data from separate tables and/or perform aggregation functions can be really handy to have in these situations. SQL has a huge advantage here. In the NoSQL world, you could technically pipe all of your data to some kind of data warehouse for exploratory queries. But there's something to be said for the simplicity of being able to run ad-hoc queries on your data to answer these sorts of questions.

Aggregation Functions

Aggregate functions are one of the most powerful features of SQL. Instead of having your application retrieve thousands of records just to compute some values, it often makes sense to just have the database do the arithmetic locally and send back the end result. These aggregate functions - such as SUM, AVG, etc are often times not supported by NoSQL databases.

I list this as a secondary consideration because

Many NoSQL databases do support aggregation functions
Even if you don't have them, in many cases it's reasonable to have the same functionality in the application code

That said, it's still conceivable that an application may rely so heavily on aggregations that one may want to steer toward SQL because of it.

Example Scenarios

Let's take a look at some example applications and grapple with which database paradigm we should go with for each.

Example: Video Streaming Service Analytics

Consider a consumer facing movie streaming service. The developer of the service wishes to have an analytics platform to visualize (via charts) what people are watching, for how long, on what platform, etc.

Some access patterns that emerge from this requirement are:

For each hour-long interval of the past 24 hours, what was the number of people actively watching a movie?
Same as the above, but only for movie X
Same as above, but only the subset of users watching on a mobile device
Same as above, but for over each day long interval over the past 365 days

It's easy to see that unless we significantly constrain the available parameters, pre-computing this data is not feasible due to a combinatorial explosion of possibilities.

To handle all of this, one reasonable approach is to use a "ledger" of movie views:

We'd use GROUP BY and WHERE on this ledger data to get at what we need for a particular query. I consider SQL-like functionality to be non negotiable for this use case, so I'd choose a SQL database for this application.

A side note would be that even though we went the SQL route, the data in this ledger could potentially be denormalized to reduce the number of JOIN operations required. The decision to do so would have to carefully weigh any query performance improvement achieved against the extra storage capacity required.

Example: Journaling Platform

Let's consider a journaling application that allows users to create personal, text-based journal entries that are annotated with the date and time on which they were created.

The access patterns might be

Get a all journal entry titles and dates for user A between datetime X and datetime Y
Get the contents of journal entry X for user A
Create/update/delete journal entry at datetime X for user A

If we went the SQL route, we may just need one table with 4 fields:

But we don't really need any of SQL's superpowers for this application's access patterns. No complex data relationships, no aggregation functions, etc. This is the sort of scenario where it can make a lot of sense to use a NoSQL database.

If we used DynamoDB, we'd make a table with roughly the same fields as above, just with slightly different names:

pK refers to "partition key" and sK to "sort key".

With DynamoDB, tables have a special field called a "partition key". The value of this field determines which database partition the item will be placed in, such that all items with the same partition key are guaranteed to be co-located on the same shard. Choosing the right partition key can be challenging because queries must specify a partition key value that is to be matched exactly.

Sort keys are more powerful than their name would lead you to believe - they do allow you to sort the results of a query, but they also allow you to make your queries more specific. Queries can specify things like "greater than", "less than", "between", "begins with" and others with respect to the value in the sort key field. That gives us the ability to query for journal entries in a specific time range, and also sort the results as we see fit.

This schema and the constraints imposed on queries may seem downright barbaric. But what we get in exchange for these things is beautiful. In this example we're using DynamoDB specifically, but other NoSQL databases offer similar tradeoffs. As our data size and traffic volume fluctuates, DynamoDB can shard and scale this table automatically to meet demand. That means it can add and remove shards in real time as they are needed, automatically shuttling data between database nodes. No intervention is required by the developer and no interruption to the application occurs. It's hard to fully articulate how valuable that is.

Example: Blogging Platform

In the first two examples, it wasn't too hard to see which database paradigm was the best fit. In this one, things won't be quite as clear.

Consider a blogging platform like Medium. It gives authors a way to publish their articles, follow other authors that they like, and add comments to articles.

There's a very natural, arguably obvious normalized schema that can be built around this:

But let's consider some access patterns

Get a list of articles written by a specific author, sorted from newest to oldest
Get an article along with all of the comments made on it
Create/update/delete an article
Create/update/delete a comment on an article
Change author's name
Get a list of authors followed by an author
Get a list of authors that follow an author

After considering these, it turns out that we can forego SQL and a normalized schema. Let's consider what it would look like if we used DynamoDB for this application. Earlier I warned that SQL aficionados would have a visceral reaction to this approach - so if that's you, brace yourself.

There would be one table. It will hold articles, comments, author follows, and author profiles. You read that right. Why store all of these entities in a single DynamoDB table? Generally speaking, storing as many different types of entities in a single table is an idiom that is encouraged in the DynamoDB world. The reasons for that are outside the scope of this article. But specifically in this case, it affords us the ability to retrieve all 4 entities in a single query if we need to.

For example, we can get an article and all comments on that article with one query. To get article 2024-01-01T00:04.00 by author id abc in addition to all comments on that article, we can query for pK = abc and sK starts with 2024-01-01T00:04.00 . To facilitate this access pattern, we deliberately prefixed comment sort keys with the id of the associated article.

No JOIN between disparate tables required.

With this schema, we can actually accommodate all of the access patterns enumerated above, except one.

To get a list of authors that follow a specific author, we will have to create something called a Global Secondary Index on the target_id field. This will allow us to query using exact equality on that field, something by default we could only do on the pK field.

This example is a bit more ambiguous than the others in terms of which route to go for the database. I could be talked into going the SQL route, but I'd feel much better about the scalability narrative with the aforementioned approach.

Conclusion

One of my goals is to convey that the decision between database paradigms is nuanced and deserving of a bit more attention than it sometimes gets. It's easy to fall into the trap of becoming attached to a specific paradigm and using it on every new project. The reality is that every database paradigm has some kind of superpower, and we shouldn't overlook the ones that make the most sense for a given project.

The unfortunate reality is that the needs of some applications are sometimes at odds with many of the core features and idioms of the SQL world, so sometimes we have to go a different route. SQL is a timeless language that provides incredible query flexibility, making it well suited for working with normalized data. But sometimes it makes sense to forfeit that power in situations where it would be under-utilized, and grab a more appropriate superpower off the shelf instead.

What superpower are you going to choose for your next project?