At NetBase, we have been thinking about building a common deduplication (“dedup”) service for a long time. When we recently needed to add a large amount of data to our real-time index pipeline without duplications, we began to design a service that could fulfill deduping requirements at various points in our system. To successfully do this, we needed:

  • A place to store fingerprints, which we use to identify data uniqueness
  • A fast way to check for the existence of a specific fingerprint

After some surveys and tests, we found that Cassandra 2.0 would be our best bet. The new Insert-If-Not-Exists feature in Cassandra 2.0 makes it easier and more efficient to store fingerprints and check the existence of specific fingerprint. However, we ran into a problem.


Leveraging the Insert-If-Not-Exists feature for deduping sounded like a piece of cake. We expected to make one call to Cassandra’s API and get the deduplicated result. We discovered, however, that calling the API to check for the existence of a specific fingerprint was not the only task. For example, to process raw data, the dedup tool must do the following:

  1. Get raw data from the raw data storage
  2. Call the dedup service to perform deduplication
  3. Save the deduped data into another new file

To handle the large volume of documents found in raw data storage, we anticipated needing a way to save documents temporarily. Storing them on disk was one option. So between steps 1 and 2, we might have an extra step for saving all found documents. And between steps 2 and 3, we might need another step to store documents on disk before our pipeline scripts can pick them up. This introduces more failure points in the larger process.

Remember the feature we selected to perform deduplication from Cassandra? The Insert-If-Not-Exists feature.

One thing we should be aware of when using operations like this is that, if there is no existing data, the data has to be added to the storage before you can actually do something else. That is, there is a possibility that subsequent procedures might fail after the data is actually inserted. The best solution to this is to put our own procedures into the Cassandra-owned transaction. However, that is not currently an option.

Another way to do this is to keep the fingerprints we are processing during the deduplication process. In this way, we can identify what fingerprints are processing when a failure occurs and the API client is dead. Then we’ve got the chance to make it right.


The entire process is shown in the diagram below.


The idea behind this process is to use a file as the transaction unit. After the dedup tool finds matching documents, it saves them to disk as a bunch of files. The tool then starts navigating these files. Every file has its own transaction lifetime. After the tool picks a file, it starts to read every document one-by-one. The tool performs the following steps, which are illustrated in the diagram above.

  1. Read one document from the source raw file.
  2. Register the fingerprint of the document into the Processing Record, with one Processing Record for one transaction. The Processing Record could be stored in a file, a database, or anywhere as long as it persists when the tool is killed.
  3. Call the Cassandra API to check the existence of the fingerprint by using Insert-If-Not-Exists.
  4. If the Insert succeeds, the tool will put the document into a temporary file. However, if it fails, nothing happens. The tool goes back to step 1 and picks up the next document. As you can see from the above diagram, the dotted-line rectangle illustrates where the transaction begins and where it ends, which is also what the tool will do when looping each document.
  5. After the tool is done with all documents from the source raw file, it renames the temporary file generated at step 4. Then the file is ready for our pipeline scripts to pick it up. The entire transaction for this file is considered done.
  6. Since the transaction is done, the tool must erase all fingerprints in the Processing Record so these documents will not be processed again in the recovery procedure.

Data Recovery

With the Processing Record in hand, we needed to determine what recovery procedures might be necessary when the tool comes back online after a disaster. Whenever the tool is started, it checks any remaining Processing Record. Under these circumstances, there are only four possibilities:

  1. There is NO remaining Processing Record and NO finished raw file. Perfect, the transaction does not even exist.
  2. There is NO remaining Processing Record but the finished raw file is ready for pick-up. Good, the transaction was done perfectly.
  3. There are both the remaining Processing Record and the finished raw file. Well, step 6 could fail but the transaction completed. We can simply erase the Processing Record.
  4. There is a remaining Processing Record but NO finished raw file. This would be the most troublesome situation; the transaction did not complete and we would have to recover everything we did during the transaction, including:
    1. Erasing all fingerprints stored in Cassandra according to what we have in the Processing Record
    2. Erasing the Processing Record
    3. Removing any temporary NOD file

After the recovery procedure, we can let the dedup tool continue its deduplication mission.

Solving the Recovery Conflict

Happy ending? No, the story is not over yet.

What if we have multiple dedup tools dealing with the same document and suddenly they all died? Since we are providing a general dedup service here, it is very likely that the number of service users is plural. Let’s say there is one document being handled by two users and one of them succeeds while the other one does not. The one that failed to insert the fingerprint of the document is accidentally killed eventually. When it comes back online, it checks the remaining Process Record and finds out there is an unfinished transaction. Then it goes through all the recovery processes and erases the fingerprint from Cassandra and performs the deduplication again. We end up with two duplicated documents in the index.

What’s the problem here? The problem is we only use one key to check the duplication, which is the fingerprint. It doesn’t tell us who actually inserted the key. So the OK record could accidentally be removed by the not-OK user.

The solution to this is to introduce another key to identify the client who inserted the key. Here we will use the ID of the transaction. So basically, whenever we insert a fingerprint to either the Processing Record or Cassandra, we need to provide the transaction ID as well. The record in Cassandra can only be removed when both the fingerprint and the transaction ID match. So only the user who owns the original transaction ID gets to recover it. Problem solved!

Things to Keep in Mind

Here we only provide a basic outline for how to solve the data recovery problem when you use Cassandra’s Insert-If-Not-Exists feature for deduplicating data. One of the key things to remember is how to determine the success of a transaction. The definition of a successful transaction might vary across different use cases. In this case, success means that a target raw file exists. It is very important to find a reliable way to determine it. All recovery procedures depend on this.

Do you have data recovery thoughts or questions to share? Reach out by clicking here .

Premier social media analytics platform

Expand your social platform with LexisNexis news media

Power of social analytics for your entire team

Media analytics and market intelligence platform

Enrich your media analytics with social data

Media coverage for historical & real-time monitoring

Data streams & custom KPIs for advanced data science

AI, Image Analytics, Reporting Tools & more

Out-of-the-box integration with other data sources