Denormalization pattern
Denormalize the data and store them as column families in Cassandra. This is a very common practice in NoSQL data stores. There are many reasons why you might do this in Cassandra. The most important aspect is that Cassandra doesn't support joins between the column families. Redundancy is acceptable in Cassandra as storage is cheap, and this is more relevant for Cassandra because Cassandra runs on commodity hardware while many RDBMS systems need much better hardware specifications for the optimal performance when deployed in production environments. Moreover, the read and write operations are highly efficient even if the column families are huge in terms of the number of columns or rows. In the traditional RDBMS, you can create multiple indexes on a single table on various columns. But in Cassandra, secondary indexes are very costly and they affect the performance of reads and writes.
Motivations/solutions
In many situations, whenever a new requirement comes, if you think in the traditional RDBMS way, it will lead to many problems such as poor performance on read/write, long running processes, queries becoming overly complex, and so on. In this situation, one of the best approaches is to apply denormalization principles and design column families in Cassandra.
In the traditional RDBMS, the operational tables contain the data related to the current state of the entities and objects involved. So, maintaining lookups for preserving the integrity of the data is perfectly sensible. But when you have to maintain history, the concept of lookups will not work. For example, when you are generating a monthly bank account statement, the current statement should reflect the current address of the account holder. After the statement is generated, if the address of the account holder changes during the next reporting period, then the previous statements must not reflect the new address. They must have the old address, which was correct on the date that the statement was generated. In such situations, it does not make sense to keep a set of normalized tables for the historical data. The best thing to do at that time is to denormalize the data and maintain them in separate column families in Cassandra.
Complex queries are part of any system. In the RDBMS world, to shield the complexities of the query from the end users who design data entry forms, generate reports, typically views or stored procedures are designed. They are useful to run ad hoc queries, retrieve special set of records, and so on. Even though you solved the complexity problem from an end user perspective, the real problem remains unsolved. This means that when you run those queries or stored procedures, the complex joins of the tables are happening under the hood and on the fly. Because of this, the running time is greater and the processing requirements are huge. In the NoSQL world, it is better to denormalize the data and maintain them as big column families in Cassandra.
Immutable transactions are good candidates for denormalization because they capture the present state, the references it makes to other table records can be carried with the record of the transaction even if those references change in the future. The only use those transactions will have in the future is for read use cases. An immutable transaction means that once a transaction record it is written to the system, nothing is going to change in the future. There are many examples in real life that conform to this type, such as banking transaction records, weather station reading records, utility monitoring reading records, system monitoring records, service monitoring records, and you can find countless examples in your real-life applications. Event records originating from event management systems are possible candidates for denormalization, but caution has to be exercised when the event status changes and if the same record is being updated. If the event management systems generate multiple event records for state changes of the same event, denormalization will be a good fit. Capture these denormalized records in Cassandra.
Note
Performance boosting requirements are good situations where denormalization may be applied. There may be many applications performing poorly when data is being written to the RDBMS. There is a strong possibility that this is happening because of the single transaction writing data into multiple tables and many indexes are being used in those tables. Careful analysis and proper characterization of the performance problems lead to data spread across multiple tables as the root cause many times. In such cases, denormalization of the data is the obvious option, and Cassandra comes as a good fit there.
Data modeling experts in the RDBMS world are typically not comfortable in denormalization because there is a general belief that the data integrity is maintained by the RDBMS table design itself, along with other features of RDBMS such as triggers, stored procedures, and so on. The data modeling in Cassandra is different. Here, along with the data model, all the application use cases where there is a data manipulation involved is also taken into consideration. So, the data integrity maintenance is the responsibility of the applications as well. Here, denormalization is the norm and the applications using the column families are supposed to handle the data integrity to make sure that the data is good.
Best practices
Denormalization must be done with utmost care. Normalization avoids redundancy and it promotes principles to maintain data integrity. When you denormalize, the only rule that is relaxed is redundancy. Data integrity must be maintained for a successful application, even if data is denormalized. With normalized tables in the RDBMS, the primary key constraints, foreign key constraints, unique key constraints and so on serve as watchdogs that maintain the data integrity even if the applications don't care about them. Verification and validation happens even at the RDBMS level. When moving to a NoSQL store such as Cassandra, many such goodies of RDBMS are lost. So, it is the responsibility of the application designers to prevent insert anomalies, update anomalies, and delete anomalies. Even though Cassandra comes with lightweight transactions, most of the data integrity control measures have to be taken from the application's side. Cassandra security has to be used heavily to make sure that only the proper applications with the right credentials are writing data to the column families.
Example
Let's take the case of a very simple normalized relation from the RDBMS world, as shown in the following screenshot. There are two tables in the relationship. One stores the customer details and the other stores the order details of the customers. This is a one-to-many relation where every customer record in the Customer
table may have zero or more order records in the Order
table. These two tables are joined by the primary CustomerId
key.
In the Order
table, CustomerId
is a foreign key referring the CustomerId
of the Customer
table. When you denormalize this to a Cassandra column family, it will look like the one given in the following figure. In Cassandra, the primary key is a combination of CustomerId
and OrderId
in the CustomerOrder
column family. The CustomerId
column becomes the partition key for this column family.
The denormalized Cassandra column family has all the fields of its normalized counter part. The following script can be used to create the column family in Cassandra. The following is the sequence of activities given in the script:
- Create the key space.
- Create the column family.
- Insert one record into the column family.
The reason one record is inserted into the column family is to demonstrate the difference between the physical layout of the rows stored in Cassandra and how the queries are returning the same records.
Tip
It is very important to make sure that the physical layout of the column family is as expected and see how the columns are getting stored. To view the physical layout of the records, the old Cassandra CLI [cassandra-cli
] must be used. This Cassandra CLI is used throughout this book in the context where there is a need to view the physical layout of the data in the column families.
Whenever a new column family is defined in Cassandra, it is very important to have an understanding of the physical and the logical views, and this helps the characterization of the column family growth and other behaviors. The following script is to be executed in cqlsh
:
CREATE KEYSPACE PacktCDP1 WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}; USE PacktCDP1; CREATE TABLE CustomerOrder ( CustomerId bigint, OrderId bigint, CustomerName text static, Email text static, OrderDate timestamp, OrderTotal float, PRIMARY KEY (CustomerId, OrderId) ) WITH CLUSTERING ORDER BY (OrderId DESC); INSERT INTO CustomerOrder (CustomerId, OrderId, CustomerName, Email, OrderDate, OrderTotal) VALUES (1,1,'Mark Thomas', '[email protected]', 1433970556, 112.50);
A great detail of attention needs to be given while choosing your primary key for a given column family. Conceptually, this is totally different from the RDBMS world. It is true that a primary key uniquely identifies a row. It may be an individual column or a combination of multiple columns. The differentiation comes in Cassandra is the way in which a row is stored in the physical nodes of Cassandra. The first column of the primary key combination is known as the partition key. All the rows in a given column family with the same partition key get stored in the same physical Cassandra node. The commands in the script here are to be executed in the Cassandra CLI interface:
USE PacktCDP1; list CustomerOrder; Using default limit of 100 Using default cell limit of 100 RowKey: 1 => (name=1:, value=, timestamp=1433970886092681) => (name=1:customername, value=4d61726b2054686f6d6173, timestamp=1433970886092681) => (name=1:email, value=6d74406578616d706c652e636f6d, timestamp=1433970886092681) => (name=1:orderdate, value=000000005578a77c, timestamp=1433970886092681) => (name=1:ordertotal, value=42e10000, timestamp=1433970886092681) 1 Row Returned.
In the output, take a look at the row key. In the preceding example, as per the primary key used, the CustomerId
field will be the row key. In other words, for every CustomerId
, there will be one wide row. It is termed wide because all the records of a given CustomerId
field, which is a partition key, will be stored in one row. One physical row of the Cassandra column family stores contains many records. In the use cases for which the column family is designed, it is important to make sure that if the row is growing, whether it is going to run out of the prescribed number of columns? If yes, then the design has to be looked at again and an appropriate partition key has to be identified. At the same time, it is not economical to have column families having only a very few rows and very few columns.
In a typical RDBMS table, the customer details will be in one or more tables. The transaction records such as order details will be in another table. But, here denormalization is applied, and data coming from those two logical entities are captured in one Cassandra column family.
The following CQL SELECT
command gives the output in a human readable format:
SELECT * FROM CustomerOrder;
The output can be seen in this screenshot: