Setting up a project involves creating a keyspace and tables. This can be done via the CQL shell or from your favorite programming language.
Here are the statements to create the schema:
Note
Universally unique identifiers: uuid
and timeuuid
In the preceding CQL statements, there are two interesting data types—uuid
and timeuuid
. uuid
stands for universally unique identifier. There are five types of them. One of these uuid
types is timeuuid
, which is essentially uuid
type 1 that takes timestamp
as its first component. This means it can be used to sort things by time. This is what we wanted to do in this example: sort posts by the time they were published.
On the other hand, uuid
accepts any of these five types of uuid
as long as the format follows the standard uuid
format.
In Cassandra, if you have chosen the uuid
type for a column, you will need to pass uuid
while inserting the data. With timeuuid
, just passing timestamp
is enough.
The first statement requests Cassandra to create a keyspace named weblog
with replication factor 1 because we are running a single node Cassandra on a local machine. Here are a couple of things to notice:
- The
column
tags in the posts
table is a set of strings. - The primary key for
posts
, categories
, and comments
has more than one component. The first of these components is a partition key. Data with the same primary key in a table resides on the same machine. This means, all the posts' records that belong to one blog stays on one machine (not really; if the replication factor is more than one, the records get replicated to as many machines). This is true for all the tables with composite keys. - Categories have three components in its primary key. One is the category name, which is the partition key, another is the blog ID, and then the post ID. One can argue that inclusion of the post ID in the primary key was unnecessary. You could just use the category name and blog ID. The reason to include the post ID in the primary key was to enable sorting by the post ID.
- Note that some of the IDs in the table definition are
timeuuid
. The timeuuid
data type is an interesting ID generation mechanism. It generates a timestamp-based (provided by you) uuid
, which is unique and you can use it in applications where you want things to be ordered by chronology.
This section demonstrates inserting the records in the schema. Unlike RDBMS, you will find that there are some redundancies in the system. You may notice that you cannot have a lot of rules enforced by Cassandra. It is up to the developer to make sure the records are inserted, updated, and deleted from appropriate places.
Note
Note that the CQL code is just for instruction purposes and is just a snippet. Your output may vary.
We will see a simple INSERT
example now:
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
The application would generate uuid
or you will get uuid
from an existing record in the blogs
table based on a user's e-mail address or some other criteria. Here, just to be concise, the uuid
generation is left to Cassandra, and it is retrieved by running the SELECT
statement. Let's insert some posts to this blog:
# First post
cqlsh:weblog> INSERT INTO posts (id, blog_id, title, content, tags, posted_on) VALUES (now(), 83cec740-22b1-11e4-a4f0-7f1a8b30f852, 'first post', 'hey howdy!', {'random','welcome'}, 1407822921000);
cqlsh:weblog> SELECT * FROM posts;
blog_id | id | content | posted_on | tags | title
------------+-----------+------------+--------------------------+-----------------------+------------
83cec... | 04722... | hey howdy! | 2014-08-12 11:25:21+0530 | {'random', 'welcome'} | first post
(1 rows)
cqlsh:weblog> INSERT INTO categories (cat_name, blog_id, post_id, post_title) VALUES ( 'random', 83cec740-22b1-11e4-a4f0-7f1a8b30f852, 047224f0-22b2-11e4-a4f0-7f1a8b30f852, 'first post');
cqlsh:weblog> INSERT INTO categories (cat_name, blog_id, post_id, post_title) VALUES ( 'welcome', 83cec740-22b1-11e4-a4f0-7f1a8b30f852, 047224f0-22b2-11e4-a4f0-7f1a8b30f852, 'first post');
# Second post
cqlsh:weblog> INSERT INTO posts (id, blog_id, title, content, tags, posted_on) VALUES (now(), 83cec740-22b1-11e4-a4f0-7f1a8b30f852, 'Fooled by randomness...', 'posterior=(prior*likelihood)/evidence', {'random','maths'}, 1407823189000);
cqlsh:weblog> select * from posts;
blog_id | id | content | posted_on | tags | title
------------+-----------+---------------------------------------+--------------------------+-----------------------+-------------------------
83cec.... | 04722... | hey howdy! | 2014-08-12 11:25:21+0530 | {'random', 'welcome'} | first post
83cec... | c06a4... | posterior=(prior*likelihood)/evidence | 2014-08-12 11:29:49+0530 | {'maths', 'random'} | Fooled by randomness...
(2 rows)
cqlsh:weblog> INSERT INTO categories (cat_name, blog_id, post_id, post_title) VALUES ( 'random', 83cec740-22b1-11e4-a4f0-7f1a8b30f852, c06a42f0-22b2-11e4-a4f0-7f1a8b30f852, 'Fooled by randomness...');
cqlsh:weblog> INSERT INTO categories (cat_name, blog_id, post_id, post_title) VALUES ( 'maths', 83cec740-22b1-11e4-a4f0-7f1a8b30f852, c06a42f0-22b2-11e4-a4f0-7f1a8b30f852, 'Fooled by randomness...');
Note
You may want to insert more rows so that we can experiment with pagination in the upcoming sections.
You may notice that the primary key, which is of type timeuuid
, is created using Cassandra's built-in now()
function, and we repeated the title in the categories
table. The rationale behind repetition is that we may want to display the title of all the posts that match a tag that a user clicked. These titles will have URLs to redirect us to the posts (a post can be retrieved by the blog ID and post ID). Alternatively, Cassandra does not support a relational connect between two tables, so you cannot join categories and posts to display the title. The other option is to use the blog ID and post ID to retrieve the post's title. However, that's more work, and somewhat inefficient.
Let's insert some comments and upvote and downvote some posts and comments:
Counters are always inserted or updated using the UPDATE
statement.
Now that we have data inserted for our application, we need to retrieve it. To blog applications, usually the blog name serves as the primary key in their database. So, when you request cold-caffein.blogspot.com, a blog metadata table with the blog ID as cold-caffein
exists. We, on the other hand, can use the blog uuid
to request to serve the contents. So, we assume that having the blog ID is handy.
Let's display posts. We should not load all the posts for the user upfront. It is not a good idea from the usability point of view. It demands more bandwidth, and it is probably a lot of reads for Cassandra. So first, let's pull two posts at a time from ones posted earlier:
This was the first page. For the next page, we can use an anchor. We can use the last post's ID as an anchor, as its timeuuid
increases monotonically with time. So, posts older than that will have the post ID with smaller values, and this will work as our anchor:
You can retrieve the posts on the next page as follows:
Now for each post, we need to perform the following tasks:
- Pull a list of comments
- Up and downvotes
- Load comments as follows:
- Individually fetch counters for each post and comment as follows:
Now, we want to facilitate the users of our blogging website with the ability to click on a tag and see a list of all the posts with that tag. Here is what we do:
We can obviously use the pagination and sorting here. I think you have got the idea.
Sometimes, it is nice to see what people generally comment. It would be great if we could find all the comments by a user. To make a nonprimary key field searchable in Cassandra, you need to create an index on that column. So, let's do that:
This completes all the requirements we stated. We did not cover the update
and delete
operations. They follow the same pattern as the insertion of records. The developer needs to make sure that the data is updated or deleted from all the places. So, if you want to update a post's title, it needs to be done in the posts
and category
tables.