DECISION NODE

February 22, 2024
Decision Nodes

Beyond Normalization

Raul Junco
,ㅤ
Software Engineering Manager

Raul has been developing backend systems for almost 20 years. He is a Software Engineering Manager by day and a writer by night, sharing his experiences in the field. Raul simplifies software engineering and systems design, helping software engineers. You can find his writing weekly on LinkedIn and X.

Databases are at the core of the 90% of the software we build. We need to store our customers, our orders, and our transactions.

I spent years learning:

  • How to name the tables.
  • How do I know which column is below each table?
  • How do I make sure I don’t repeat my data?
  • How to store my data in a more optimal way.

This is what we call Normalization, a foundational principle in database design.

Open in Eraser

Over the years, normalization has been the bedrock of designing efficient, reliable databases.

Normalization aims to reduce redundancy and improve data integrity.

The idea is to organize data into tables to decrease duplication. Also, define relationships between those tables.

But, the digital landscape has changed.

The volume of the data we have now is amazing. Look around your house; how many devices do you have now that produce data you can access?

The speed has also changed. You want to get info for that data; you want to make decisions based on that info. And you want it at the reach of a click.

This has pushed the boundaries of traditional database design.

Normalization is no longer the be-all and end-all of database design. Normalization is critical, don't get me wrong, but the point is that it is not enough anymore.

Three cases when normalization can fall short.

Real-time processing needs

If you need real-time data processing, you need low latency, and normalization doesn't play well with low latency.

Some duplication will be better for reading operations, even at the cost of complex writes or possible inconsistency.

Microservices architecture

With microservices architectures, databases are now often distributed and decentralized. Each service might have its own database optimized for its specific needs.

Even more, some microservices can have different data models and storage mechanisms that coexist.

Data warehousing and analytics

Normalization is bad for the analytical process. Joins are not your best friends when analyzing big volumes of historical data. In this context, quicker read times are more important than the storage costs.

Normalization can introduce performance overhead.

Since normalization involves splitting data into many tables to avoid redundancy, you are causing a side effect problem: now you have to join multiple tables.

This works fine in small-medium datasets, but In large datasets, so many joins degrade performance.

What's the solution?

We have to unlearn a little bit and learn new techniques. And just because it is the opposite, we call it Denormalization.

Denormalization introduces redundancy to improve performance, particularly for read-heavy operations. It is often used as a counterbalance to normalization.

The goal is to reduce the joins that degrade performance. As an additional result, you reduce the query complexity, too. By denormalizing a database, you can make certain operations more efficient.

How to denormalize?

I will divide it into 5 steps to help you:

1. Choose the strategy:

  • If the data doesn't change often. You can duplicate the information in many tables to avoid joins.
  • If two tables are often joined, you can combine them into a single table. But make sure the merged table maintains all necessary information.
  • If you run reporting and analytical queries, look for opportunities to pre-calculate aggregates. Pre-computed aggregates will save a lot of processing time.
  • Use Materialized Views to store the result of a complex query. This form of logical denormalization doesn't need to change the underlying table structures.
Open in Eraser

2. Identify the tables and columns:

  • Use profiling tools to pinpoint where the database performance is lagging.
  • Look for tables often accessed for reading but less often for writing.
  • If certain tables are always joined in queries, consider merging them or duplicating some of their data.

3. Implement the changes:

  • Change the database schema to introduce the denormalization changes. This might involve adding columns, changing table structures, or creating new tables.
  • Adjust the application's data access layer to leverage the denormalized structures. Make sure that any write operations maintain the consistency of redundant data.
  • Migrate existing data to fit the new schema. This might involve populating new columns, merging tables, or recalculating aggregates.

4. Maintain data integrity:

Maintaining Data integrity is your tradeoff (yes, there is no free lunch in denormalization).

  • Set application logic or database triggers to maintain consistency across denormalized data. This is one of the less complex mechanisms but only works within the same database if you have your data spread across many databases, you have to use a more robust technique.
  • Set up monitoring to detect anomalies that show inconsistencies. This can work globally; you can set jobs (or serverless functions) to run consistency checks. I could compare counts, sums, and other aggregates between tables (in different data sources) that should be in sync.Take into account that some of these anomalies can’t be auto-healed, so the other component of this approach is alerting; you need humans in the loop.
  • Batch processes are also a good strategy to maintain consistency. This approach separates the denormalization process from the application workflow. Schedule reconciliation jobs run during off-peak hours to correct discrepancies in denormalized data. This type of update is set in intervals rather than in real-time, which is less resource-intensive and disruptive.

There is no “best” approach here because it depends on your contest, and I found a combination works the best.

5. What gets measured gets improved:

  • Before and after denormalizing, run performance tests. Measure the impact on query execution times, system throughput, and general application performance.
  • Denormalization is not a one-size-fits-all solution. In the real world, you will have to revisit the decisions on many occasions.
  • If system usage increases (a good problem), you will discover new bottlenecks; this is why it is so critical to have monitoring in place.

When do you need denormalization?

  1. In data warehousing and business intelligence scenarios. When the focus is on complex queries and data aggregation for reporting.
  2. For applications where read performance is critical and writes are relatively infrequent.
  3. In a microservices architecture. Individual services may opt for denormalization to optimize performance and reduce data aggregation.
  4. In cases where you have hierarchical data. Hierarchical data is hard to query.

An e-commerce application. As a practical example.

Every e-commerce application needs at least four tables.

  • Products
  • Orders
  • OrderDetails
  • Customers

School taught us we need to normalize the schema and store the product and customer information in their own tables, products and customers.

Open in Eraser

The request

The sales department wants to check if the new campaign marketing is working. They ask us to print a summary of orders, including product names and the total amount for each order.

The problem

To get this info, you must join the orders, orderDetails, and products tables.

The current normalized schema forces us to do an expensive join with the products table.

Once you reach the millions of records, that will be slow and computationally costly.

The solution

The Name of a product doesn't change often. We can modify the orderDetails table to include ProductName. This will save us that expensive join not touching the products table.

The same approach applies to the customer’s name but to the orders table.

Open in Eraser

Pros

  • Printing an order summary now requires fewer joins and is faster.
  • The application code for querying order summaries becomes simpler.

Cons

  • If a product name changes, you must propagate the change to every relevant row in the orderDetails table.
  • If a customer name changes, you must propagate the change to every relevant row in the orders table.
  • Increasing the complexity of update operations and the risk of data inconsistency.

What are the benefits?

Performance optimization and throughput

  • Denormalization will reduce the complexity of queries by putting data together and avoiding joins.
  • Denormalization enables faster reads. This is critical in systems where read operations outnumber write operations, reducing the response time.

Simplified queries

  • Storing data in a less normalized form produces easier queries to write and understand.
  • Saving development time and reducing the likelihood of errors in query logic.

Following the practical example, check how only two columns simplify the next query.

Open in Eraser

Real-world databases need smart designs.

One common misconception is that the DB should be one way or the other, and that’s not true.

Sometimes, you find the perfect schema by combining the two approaches.

Most applications have different workloads. Some parts are read-heavy (such as user dashboards or reports). Other parts are write-heavy (such as transaction processing systems).

A mixed approach allows for:

Optimized read performance where necessary by denormalizing parts of the schema often accessed together.

Maintained data integrity and minimized storage for parts of the application where transactional consistency is critical.

TL;DR

Denormalization optimizes database read performance. However, it introduces redundancy, increases storage needs, and complicates data maintenance.

Denormalization speeds up data retrieval by reducing the need for complex joins and aggregations at query time.

However, it requires careful handling to avoid data inconsistencies. In a denormalized schema, every insert, update, or delete operation potentially affects multiple copies of the data.

Denormalization is a strategic decision. This decision should be a balance between performance improvements and data integrity.

Denormalization is especially beneficial for read-heavy applications where performance is critical. Storing aggregated data, such as counts, sums, or averages, eliminates the need for calculations.

A combination of normalized and denormalized schemas can be your sweet spot.

You have the tool now; what are you waiting for to make that report faster?