Data Modeling — Designing Facebook style “comments” with SQL

Neha Jirafe
5 min readJul 9, 2020

--

Let’s agree , we all have used the “comments” section of some or the other social networking site.

Now lets see how we can model this in a SQL database.

Level 1 comments :

Taking the example of Facebook , In a very simplistic terms a user “comments” on a post .The simplest form of comments are level 1 comments

  • A user creates a post.
  • A user comments on a post.

Pretty straight forward isn’t it , now lets add some complexity to it

Level 2 comments :

Now the user wants to respond to a comment on his post, for simplicity lets keep this to only one level down.

A user creates a post.

A user comments on a post.

  • A user comments(sub comment) on a comment.

Now let see how we can query the comments on a post “1111” by each level

SELECT 
post_id,
parent.comment as level1_comment,
child.comment as level2_comment
FROM
Comment parent , Comment child
WHERE
post_id = '1111' and
child.parent_comment_id = parent.comment_id

Things are still good until we have only 2 levels

Level “n” comments :

Lets imagine what if we had “n” levels nested comments like in “whatsapp” , How would the query look like ? To query level 6 comment , we will have to self join the data 6 times.

Approach 1 — Repeated Self-joins :

Repeated self joins can get quite complicated, these are also one of the most inefficient queries to execute, slowing down the performance

SELECT
post_id,
p1.comment as level1_comment,
p2.comment as level2_comment,
p3.comment as level3_comment,
p4.comment as level4_comment,
p5.comment as level5_comment,
p6.comment as level6_comment
FROM
Comment p1
LEFT JOIN
Comment p2 on p2.parent_comment_id = p1.comment_id
LEFT JOIN
Comment p3 on p3.parent_comment_id = p2.comment_id
LEFT JOIN
Comment p4 on p4.parent_comment_id = p3.comment_id
LEFT JOIN
Comment p5 on p5.parent_comment_id = p4.comment_id
LEFT JOIN
Comment p6 on p6.parent_comment_id = p5.comment_id
WHERE
post_id = '1111' and

This model can quickly go out of control once your database starts scaling.

Approach 2— Recursive CTE :

MySQL support Recursive CTE (Common Table Expression) , the recursive CTE can be used to create a Path-style Identifiers , to query the database. However this just simplifies the query syntax and data representation , it by no means reduces the complexity of the underlying query engine.

WITH RECURSIVE cte AS 
( SELECT
comment,
comment_id AS path,
user_id,
post_id
FROM
Comment
WHERE parent_comment_id IS NULL
UNION ALL
SELECT
comment,
CONCAT(parent.path,'/',child.name)comment_id AS comment_id,
user_id,
post_id
FROM
Comment parent , Comment child
WHERE
child.parent_comment_id = parent.comment_id )
SELECT * FROM cte;

The data can now looks like , denormalized and flattened.

path             | comment                | user_id     | post_id
------------------------------------------------------------------
1 | I am comment 1 | .... | 1111
1/7 | I am comment 7 | .... | 1111
2 | I am comment 2 | .... | 1111
2/3 | I am comment 3 | .... | 1111
2/3/4 | I am comment 4 | .... | 1111
2/3/4/5 | I am comment 5 | .... | 1111
2/3/4/5/6 | I am comment 6 | .... | 1111

Alternative Data Model — Path-style Identifiers :

What if we create a data model to store the data in Path-style Identifier? Here we remove the self reference in “comment” table (rabbit ear) and add a new column “path”

You can now query parent comments by comparing the current row’s path to a pattern formed from the path of another row. For example, to find ancestors of comment #5 , whose path is 2/3/4/5 , do this

SELECT *
FROM
Comment AS c
WHERE (SELECT
path
FROM
Comment
WHERE
comment_id = 5) LIKE c.path || '%';

This matches the patterns formed from paths of ancestor 2/3/4/% , 2/3/% and 2/%

Drawbacks Path-style Identifiers:

The path enumeration has some drawbacks, The database can’t enforce that the path is formed correctly or that the values in the path correspond to existing comments.

Another thing to note is that VARCHAR column has a limit , and the depth of your path will be limited to this limit, In most cases it wont be problem.

Alternative Data Model — Closure Tables :

The Closure Table solution is a simple and elegant way of storing hierarchies. It involves storing all paths through. the tree , not just those with a direct parent-child relationship.

Instead of using the “Comment” table to store information about the parent-child relationship , lets create a lookup table “parent_child_comment”.

parent_comment_id             | child_comment_id 
-----------------------------------------------------
1 | 1
1 | 7
2 | 2
2 | 3
2 | 4
2 | 5
2 | 6
3 | 3
3 | 4
3 | 5
3 | 6
4 | 4
4 | 5
4 | 6
5 | 5
5 | 6
6 | 6

Queries

The query to retrieve all the child comments for comment #3 will be as follows

SELECT *
FROM
Comment AS c
JOIN parent_child_comment p ON c.comment_id = p.child_comment_id
WHERE
p.parent_comment_id = 3;

The query to retrieve all the parent comments for comment #6 will be as follows

SELECT c.*
FROM
Comment AS c
JOIN parent_child_comment p ON c.comment_id = p.child_comment_id
WHERE
p.child_comment_id = 6;

Closure Table is the most versatile design for modeling multilevel deep hierarchies , it requires additional table with a lot of rows , and the extra space is a tradeoff for reducing computing efficiencies

Note : The cost for other operations INSERT , UPDATE and DELETE should also be considered while creating a Data Model. Choose the one that suits your application requirement on the basis of the efficiencies of these operations.

Using a NO-SQL data model is another option for such problems and the design constraints should be considered equally.

Happy Reading !

--

--