Skip to main content

Document Duplication Detection and Reporting

There may be a need to identify duplicate documents in your enterprise and Simflofy allows you to identify these duplicate documents in a variety of ways.

One way to identify duplicate documents is by using the Duplication Check Job Task which allows you to log, skip or fail documents that are duplicates. This works well for large scale integrations when combining a number of legacy source systems into one new enterprise content management system.

Another way to identify duplicate documents would be by leveraging Simflofy's Reporting Output Connector. Using the reporting output connector you can read content in from any source system Simflofy supports and report on the content that is found. One of these reports is a hash of each document seen. Using this hash plus MongoDB's aggregation framework we can generate a CSV or JSON reports of all duplicate records. You can obtain the hash of a document by including the Hash Generator Job Task in your Job Tasks.

After crawling your source system and outputting to the Simflofy Reporting Connector you can now run the following commands against MongoDB. Start my typing mongo in your terminal:

Depending on how many documents you found during the crawl > 100,000, you may need to add docHash index.

db.tsRecordProcessed.createIndex( { docHash: 1 } )

Next we group by docHash and output to a new collection named duplicates (You can name the new output collection to anything you like).

db.tsRecordProcessed.aggregate([{$group:{_id:"$docHash",docs:{$push:"$doc_id"},doc_names:{$push:"$doc_name"}}},{$project:{docs:1,doc_names:1,numDocs:{$size:"$docs"}}},{$match:{numDocs:{$gt:1}}},{$out:"duplicates"}])

You can now export dedupes collection to CSV or JSON using mongoexport

mongoexport --db simflofy --collection duplicates --fields _id,docs,doc_names
--username user --password "pass" --type=csv --out duplicates.csv

Related Articles:
Simflofy Integration Jobs
Adding Tasks To Integration Jobs