MongoDB Oplog to SQL Parser exercise

Problem Statement Story 1 (parsing insert oplog)Story 2 (parsing update oplog)Story 3 (parsing delete oplog)Story 4 (create table with one oplog entry)Story 5 (create table with multiple oplog entries)Story 6 (alter table with multiple oplog entries)Story 7 (handle nested Mongo documents)Story 8 (reading oplogs from a file)Story 9 (reading oplogs from MongoDB)Story 10 (Bookmarking Support - nice to have)Story 11 (Distributed Execution - nice to have)Overall Instructions

Problem Statement

Write a program to parse MongoDB operations log (oplog) and generate equivalent SQL statements.

We have a scenario where an organization used MongoDB initially but now needs to move to an RDBMS database. This data transition can be made easy if we can find a way to convert the JSON documents in MongoDB collections to equivalent rows in relational DB tables. That's the purpose of this program.

The MongoDB server generates the Oplog, an ordered collection of all the write operations (insert, update, delete) to the MongoDB. Your job is to parse these oplogs and generate equivalent SQL statements.

There’s already an open-source tool, stampede, that converts MongoDB oplogs to SQL; we are simply attempting to develop an implementation in Go.

A sample MongoDB oplog looks like this:


{
  "op" : "i",
  "ns" : "test.student",
  "o" : {
    "_id" : "635b79e231d82a8ab1de863b",
    "name" : "Selena Miller",
    "roll_no" : 51,
    "is_graduated" : false,
    "date_of_birth" : "2000-01-30"
  }
}

The main fields in the oplog are:

op: This indicates the type of operation. It can be i (insert), u (update), d (delete), c (command), n (no operation). For this implementation, we’ll only care about insert, update and delete operations.

ns: This indicates the namespace. Namespace consists of database and collection name separated by a . In above case, database name is test and collection name is student.

o: This indicates the new data for insert or update operation. In above case, a student document is inserted in the collection.

The oplog contains some other fields like version, timestamp, etc. but for our consideration, we can ignore those.

We have divided the problem statement into multiple stories. You’re supposed to implement the stories.

Story 1 (parsing insert oplog)

Parse the insert oplog JSON and convert that into equivalent SQL insert statement.

Here’s the mapping of MongoDB concepts to their equivalent relational database concepts

Database in MongoDB maps to schema in relational database

Collection in MongoDB maps to table in relational database

A single JSON document in MongoDB maps typically to a row in relational database.

Sample Input:


{
  "op": "i",
  "ns": "test.student",
  "o": {
    "_id": "635b79e231d82a8ab1de863b",
    "name": "Selena Miller",
    "roll_no": 51,
    "is_graduated": false,
    "date_of_birth": "2000-01-30"
  }
}

Expected Output:


INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('635b79e231d82a8ab1de863b', '2000-01-30', false, 'Selena Miller', 51);

You can assume that the above oplog is generated by following MongoDB command that inserts some data in student table.


use test;

db.student.insertOne(
	{
  	name: "Selena Miller", 
	  roll_no: 51, 
  	is_graduated: false, 
	  date_of_birth: "2000-01-30"
	}
);

Assumptions:

The i in op key stands for insert operation.

Write a function that accepts oplog JSON as input and returns the SQL statement as output. Write test case for verifying your program works as expected.

The actual MongoDB oplog also contains some other fields like version, timestamp, etc. but for our consideration, we can ignore those. Hence, the above oplog contains only the fields which are relevant for our SQL conversion use case.

The ns key indicates a combination of db name and collection name. In above example, db name is test and collection name is student. Db name and collection name will be separated by a .

For simplicity, for now, assume that there is no nested JSON (arrays or objects) in MongoDB collection.

Expectations:

Your code should be generic enough to extract the db and collection name from ns field. It should also extract following types from JSON - string, boolean and number. In above example, name is a string variable, roll_no is a number variable and is_graduated is boolean. For now, you can treat date_of_birth as string (and not a date field type)

Since the JSON fields are unordered, the order of columns in your insert into statement may differ from that of the expected output. This is okay, as long as you are able to run the SQL statement in PostgreSQL without any errors. For that, you’ll need to create the test schema and student table first. You can do that manually as follows:


CREATE SCHEMA test;

CREATE TABLE test.student
  (
     _id           VARCHAR(255) PRIMARY KEY,
     name          VARCHAR(255),
     roll_no       FLOAT,
     is_graduated  BOOLEAN,
     date_of_birth VARCHAR(255)
  );

Later, in Story 4, we’ll ask you to modify the code to also generate the create table statement. For now, your program should only generate insert into statement.

Story 2 (parsing update oplog)

Parse the update oplog JSON and convert that into equivalent SQL update statement.

Sample Input (for setting new value to a field):


{
   "op": "u",
   "ns": "test.student",
   "o": {
      "$v": 2,
      "diff": {
         "u": {
            "is_graduated": true
         }
      }
   },
    "o2": {
      "_id": "635b79e231d82a8ab1de863b"
   }
}

Here, we are setting is_graduated to true for the same student collection.

Expected Output:


UPDATE test.student SET is_graduated = true WHERE _id = '635b79e231d82a8ab1de863b';

Sample Input (for un-setting value to a field):


{
   "op": "u",
   "ns": "test.student",
   "o": {
      "$v": 2,
      "diff": {
         "d": {
            "roll_no": false
         }
      }
   },
   "o2": {
      "_id": "635b79e231d82a8ab1de863b"
   }
}

Expected Output:


UPDATE test.student SET roll_no = NULL WHERE _id = '635b79e231d82a8ab1de863b';

Assumptions:

The u in op key stands for update operation.

The o field contains the update operation details. In this case, it includes the following subfields:

The $v field specifies the protocol version used for the update operation. In this case, the value is 2.
The diff field represents the changes being made to the document. In this case, it contains the following subfield:

u: modifies a field of a document and sets the value of a field in the document. (first example above)
d: removes a field from a document (second example above)

The key o2 represents the row identifier or the WHERE clause field in SQL

For simplicity, assume that the _id would always be the updation criteria

Assume that no new column will be added via this update operation for now.

Assume that the table and the data exist from before (which can be done manually).

Expectations:

You should be able to run the SQL statement generated by your program into PostgreSQL without any errors. For that, you’ll need to create the test schema and student table and also insert a row in student table (from first story).

Story 3 (parsing delete oplog)

Parse the delete oplog JSON and convert that into equivalent SQL delete statement.

Sample Input:


{
  "op": "d",
  "ns": "test.student",
  "o": {
    "_id": "635b79e231d82a8ab1de863b"
  }
}

Expected Output:


DELETE FROM test.student WHERE _id = '635b79e231d82a8ab1de863b';

Assumptions:

The d in op key stands for delete.

The o key contains the _id of the field to be deleted

For simplicity, assume that the _id would always be the deletion criteria

Assume that the table exists from before (which can be done manually).

Expectations:

You should be able to run the SQL statement generated by your program into PostgreSQL without any errors. For that, you’ll need to create the test schema and student table and also insert a row in student table (from first story). When you run the delete statement from the output, the student row should be deleted successfully from PostgreSQL table.

Story 4 (create table with one oplog entry)

This story is the modification of Story 1. In this story, you’ll parse the same insert oplog JSON from Story 1 and convert it to equivalent SQL statements. However, you’ll also generate the create schema and create table statements along with insert into statement.

Features to implement in this story:

Generate CREATE SCHEMA SQL statement

Generate CREATE TABLE SQL statement

Generate INSERT INTO SQL statement

Sample Input:


{
  "op": "i",
  "ns": "test.student",
  "o": {
    "_id": "635b79e231d82a8ab1de863b",
    "name": "Selena Miller",
    "roll_no": 51,
    "is_graduated": false,
    "date_of_birth": "2000-01-30"
  }
}

Expected output:


CREATE SCHEMA test;

CREATE TABLE test.student
  (
     _id           VARCHAR(255) PRIMARY KEY,
     date_of_birth VARCHAR(255),
     is_graduated  BOOLEAN,
     name          VARCHAR(255),
     roll_no       FLOAT
  );

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('635b79e231d82a8ab1de863b', '2000-01-30', false, 'Selena Miller', 51);

Assumptions:

In the above output, the create table statement is split into multiple lines. This is done only for readability purpose. You should generate the create table statement in a single line.

Feel free to modify the code and tests written as part of Story 1. The input to Story 1 and 4 is same, but in the output, we now expect create schema and create table statements as well.

Expectations:

You should be able to run all the SQL statement generated by your program into PostgreSQL without any errors.

Story 5 (create table with multiple oplog entries)

Until now, we were handling only one oplog at a time. However, now we need to handle multiple insert oplogs for the same collection. For simplicity, let’s assume that there are no field changes across these two oplogs. The only thing that changes is the value of the fields.

As per previous story, create schema and create table statements are generated for every insert oplog. Now, we need to fix the issue where the create schema and create table statements are generated only once for each collection.

Sample Input:


[
  {
    "op": "i",
    "ns": "test.student",
    "o": {
      "_id": "635b79e231d82a8ab1de863b",
      "name": "Selena Miller",
      "roll_no": 51,
      "is_graduated": false,
      "date_of_birth": "2000-01-30"
    }
  },
  {
    "op": "i",
    "ns": "test.student",
    "o": {
      "_id": "14798c213f273a7ca2cf5174",
      "name": "George Smith",
      "roll_no": 21,
      "is_graduated": true,
      "date_of_birth": "2001-03-23"
    }
  }
]

In the input, there are two insert oplogs for the same database and collection. The only difference in the two oplogs is the values of JSON fields. The type and the number of fields are same for both oplogs.

Expected output:


CREATE SCHEMA test;

CREATE TABLE test.student
  (
     _id           VARCHAR(255) PRIMARY KEY,
     date_of_birth VARCHAR(255),
     is_graduated  BOOLEAN,
     name          VARCHAR(255),
     roll_no       FLOAT
  );

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('635b79e231d82a8ab1de863b', '2000-01-30', false, 'Selena Miller', 51);

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('14798c213f273a7ca2cf5174', '2001-03-23', true, 'George Smith', 21);

Assumptions:

In the above output, the create table statement is split into multiple lines. This is done only for readability purpose. You should generate the create table statement in a single line.

Expectations:

The input to the program has changed from a single oplog JSON to an array of oplogs. Make sure your code is able to handle both.

You will have to modify the code and tests written as part of Story 4.

Note that both oplogs belong to the same database and collection, hence the create schema statement in SQL is generated only once.

Story 6 (alter table with multiple oplog entries)

The input for this story is very similar to Story 5 above. Except, in the second oplog, there’s a new field -phone. Your job is to generate an alter table statement and then generate an insert into statement for the second oplog.

Thus, you’ll have to generate SQL statements in the following order:

Generate CREATE SCHEMA SQL statement

Generate CREATE TABLE SQL statement

Generate INSERT INTO SQL statement

Generate ALTER TABLE SQL statement

Generate INSERT INTO SQL statement

Sample Input:


[
  {
    "op": "i",
    "ns": "test.student",
    "o": {
      "_id": "635b79e231d82a8ab1de863b",
      "name": "Selena Miller",
      "roll_no": 51,
      "is_graduated": false,
      "date_of_birth": "2000-01-30"
    }
  },
  {
    "op": "i",
    "ns": "test.student",
    "o": {
      "_id": "14798c213f273a7ca2cf5174",
      "name": "George Smith",
      "roll_no": 21,
      "is_graduated": true,
      "date_of_birth": "2001-03-23",
      "phone": "+91-81254966457"
    }
  }
]

Expected Output:


CREATE SCHEMA test;

CREATE TABLE test.student
  (
     _id           VARCHAR(255) PRIMARY KEY,
     date_of_birth VARCHAR(255),
     is_graduated  BOOLEAN,
     name          VARCHAR(255),
     roll_no       FLOAT
  );

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('635b79e231d82a8ab1de863b', '2000-01-30', false, 'Selena Miller', 51);

ALTER TABLE test.student ADD phone VARCHAR(255);

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, phone, roll_no) VALUES ('14798c213f273a7ca2cf5174', '2001-03-23', true, 'George Smith', '+91-81254966457', 21);

Expectation:

You will have to modify the code and tests written as part of Story 5.

Assume that there will never be an alter table case where the existing column’s data type is changed across two oplogs. For example, you will never have a case, where first oplog has roll_no as int and second oplog modifies roll_no field as string. You don’t need to handle this case.

Your code should also handle case of more than two oplogs for the same collection.

Your program should assign null values to columns for which the JSON fields are missing.

Currently, in the sample input, we are only considering an addition of one field (phone). However, your program should handle addition of any number of new fields and generate those many number of alter table statements.

Story 7 (handle nested Mongo documents)

So far, we have handled simple JSON documents in MongoDB. In this story, you will have to handle parsing of oplogs of nested JSON objects.

Sample Input:


{
  "op": "i",
  "ns": "test.student",
  "o": {
    "_id": "635b79e231d82a8ab1de863b",
    "name": "Selena Miller",
    "roll_no": 51,
    "is_graduated": false,
    "date_of_birth": "2000-01-30",
    "address": [
      {
        "line1": "481 Harborsburgh",
        "zip": "89799"
      },
      {
        "line1": "329 Flatside",
        "zip": "80872"
      }
    ],
    "phone": {
      "personal": "7678456640",
      "work": "8130097989"
    }
  }
}

In the input, we have phone, which is a single nested JSON object and we have address, which is an array of nested JSON objects. Let’s see how this is represented in the expected SQL below.

Expected Output:


CREATE SCHEMA test;

-- create the student table
CREATE TABLE test.student (_id VARCHAR(255) PRIMARY KEY, date_of_birth VARCHAR(255), is_graduated BOOLEAN, name VARCHAR(255), roll_no FLOAT);

-- insert a row in student table
INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('635b79e231d82a8ab1de863b', '2000-01-30', false, 'Selena Miller', 51);

-- create the student_address table
CREATE TABLE test.student_address (_id VARCHAR(255) PRIMARY KEY, student__id VARCHAR(255), line1 VARCHAR(255), zip VARCHAR(255));

-- insert two rows in the student_address table (use reference of student__id)
INSERT INTO test.student_address (_id, line1, student__id, zip) VALUES ('64798c213f273a7ca2cf516e', '481 Harborsburgh', '635b79e231d82a8ab1de863b', '89799');
INSERT INTO test.student_address (_id, line1, student__id, zip) VALUES ('14798c213f273a7ca2cf5174', '329 Flatside', '635b79e231d82a8ab1de863b', '80872');

-- create student_phone table
CREATE TABLE test.student_phone (_id VARCHAR(255) PRIMARY KEY, student__id VARCHAR(255), personal VARCHAR(255), work VARCHAR(255));

-- insert a row in student_phone table (use reference of student__id)
INSERT INTO test.student_phone (_id, personal, student__id, work) VALUES ('14798c213f273a7ca2cf5199', '7678456640', '635b79e231d82a8ab1de863b', '8130097989');

Since we are dealing with relational data, we will have to create multiple tables with foreign key references. For simplicity, we will not create actual foreign key constraints in any of the tables.

We will create the separate tables for nested objects (i.e. address and phone), then insert records in those tables. Note the use of student__id column in the address and phone tables. This is the soft-foreign key to the student table’s id column. Once all the records are inserted in address and phone tables, we’ll insert records in the main parent student table.

Since we are not creating foreign key constraints and reference in the database, the order in which the tables are created and the order in which the records are inserted in those tables does not matter.

Assumptions:

For simplicity, assume nesting of JSON documents only at the top level.

Assume that_id is a varchar always.

We will not deal with the auto generation of _id. It will be a varchar, and will be a randomly generated UUID for foreign tables. _id for main table comes from the _id field of mongo oplog itself.

_id of foreign table needs to be created by the program automatically. This value then must be used for any reference to other related tables.

All the foreign tables/associated table will have the reference of primary key from main table. And the primary table will not have any referencing key from associated tables.

There’s no need to create actual foreign key references in SQL. The SQL JOINS will happen just by soft-references (i.e. values in referenced columns), not via actual foreign key constraints in database.

Expectations:

You’ll have to create associated tables with primary key as _id which will be a randomly generated UUID

You’ll have to create soft-foreign keys accordingly, so that all the details of employees can be fetched using JOINS.

The SQL should be generated such that it can be run on any relational DB without any modifications. For example, note the order of create table above. We generate the phone table first, and then the employees table. The phone table has reference to the employee table using employee__id column.

For simplicity, for a single nested object (e.g. phone), always generate a 1:M relationship. Note that the phone table has a foreign key reference to employee table using employee__id.

Story 8 (reading oplogs from a file)

In the above stories, we have parsed different types of MongoDB oplogs and generated equivalent SQL statements. Now, let’s modify our program to parse multiple oplogs by reading them from a file. Later, we’ll update the program to read directly from MongoDB’s oplog collection. For this story, we have to read oplogs, one at a time, from a JSON file containing MongoDB oplogs and convert them into equivalent SQL statements. You can use the example-input.json file as the input file.

The program should write the generated SQL statements in an output file, say output.sql. You can compare your output file with example-output.sql to verify your program.

Assumptions:

Your program should accept the filename containing the oplogs as an argument.

The program should also accept the output filename. The program should create this file and write all SQL output to that file.

Expectations:

Make sure while reading the file we process one oplog at a time. This will be helpful to you in the next story where you would be getting data infinitely in a stream.

Write the generated SQL statements in a file. You can compare your output file with our example-output.sql.

Story 9 (reading oplogs from MongoDB)

In story 8, we are reading MongoDB oplogs from a JSON file. Now, we expect you to read the oplogs directly from MongoDB and generate equivalent SQL statements. The program should connect to MongoDB, read and parse the oplogs one at a time, convert them into SQL statements and finally, execute these SQL statements directyl on a relational database (PostgreSQL, in our case).

Assumptions:

The program should read oplogs from MongoDB, convert those into equivalent SQL statements (DDL and DML statements), and execute these statements on PostgreSQL.

Expectations:

The program should process one oplog at a time. It should work in a streaming fashion, where it reads a single oplog entry from MongoDB, converts that oplog entry into multiple SQL statements and executes these statements on PostgreSQL.

The program should terminate only after receiving a SIGTERM signal (Ctrl + c). If there are no oplogs to read, the program should be waiting for more oplogs (it should not terminate). In other words, this program should run as a service.

Write the program such that it can handle input from either an oplog file or directly from MongoDB. Also, the program should be able to send output to an SQL file or execute the SQL statements directly on PostgreSQL.

JSON File (input) → SQL File (output)

JSON File (input) → Relational database (output)

MongoDB (input) → SQL File (output)

MongoDB (input) → Relational database (output)

Story 10 (Bookmarking Support - nice to have)

In Story 9, we successfully implemented the functionality to read MongoDB oplogs directly from a live MongoDB instance and generate equivalent SQL statements. However, one critical aspect of the parser's functionality is missing: bookmarking support. Bookmarking ensures that the parser can keep track of the last processed oplog, allowing it to resume processing from the correct point after a restart. This enhancement will prevent duplicate oplogs from being processed, maintaining data consistency, and improving the overall reliability of the parser.

Assumptions:

Bookmarking support will be added to the parser to keep track of the last processed oplog.

This mechanism will enable the parser to store the position of the last processed oplog, ensuring it can pick up from the correct point after a restart.

By using bookmarking, the parser will avoid reprocessing oplogs that have already been successfully executed, thus preventing duplicate data insertion and updates.

Expectations:

To achieve the objective of adding bookmarking support, we expect the following:

The program should implement a bookmarking mechanism to store the position of the last processed oplog. This information will be crucial for resuming processing from the correct point in the MongoDB oplog after a restart.

With bookmarking support, the parser should maintain data consistency by avoiding duplicate data processing, even in the case of unexpected interruptions or system restarts.

In the event of a program restart, the parser should utilize the bookmarked information to continue processing oplogs from where it left off, ensuring a seamless data migration process.

By adding bookmarking support, the parser will be better equipped to handle real-world scenarios, ensuring data integrity and reliability during data migration from MongoDB to PostgreSQL.

Story 11 (Distributed Execution - nice to have)

It would be beneficial to implement support for running the parser in a distributed manner across multiple machines. As data volume increase, there is a need to distribute the parser's execution across multiple machines to achieve higher performance and scalability. The addition of distributed execution will enable the parser to efficiently handle large datasets and effectively utilize available resources.

Overall Instructions

Create a new Git repo in the language of your choice (Java, Go, etc).

Commit and push code + tests for each of the stories.

Implement graceful shutdown in the program. Handle interrupt signals (e.g. Ctrl+C) to initiate a clean shutdown of the parser and close all external connections to MongoDB and PostgreSQL.

Feel free to make suitable assumptions if something is unclear. Document your assumptions and reasons behind them in the readme.