We worked on a Medical Records Analytics project and processed a monthly batch of data that arrived as CCDA files, very complex XMLs. You don’t want to analyze these using traditional excel because even parsing the file format would take a proprietary parser. We used many complex parsing techniques in java and stored the cleansed, processed data in a good old Oracle database.
A typical monthly batch would have records of 40 million patients. That means 40 million such files to process, and our batch would take between 8 to 10 days to complete. Same story every month!
One fine day we were done with supporting this beast and we decided to make way for Spark processing and HDFS storage. The whole idea rested on developers being able to efficiently and accurately convert the Java and PL/SQL code to Spark/Scala code. “And don’t forget to check that you get the same output, and obviously the batch would end within a day!” said the Project Manager.
Nevertheless, we started the code conversion drill and soon realized that the only way to do it is iterative. We, developers, needed something to tell us that we are making the right impact on data with every nudge and tweak we do. That’s where we built DataQ DataCompare.
After every code change, we would use the data difference report to know how effectively we are marched towards converting all the pipelines. The scrum master and project manager had actual numbers to report without reaching out for our help.
At the end of it all, we converted the whole code and released it to production within four months and guess what, the same monthly batch now takes just over 7 hours to complete. It was a great Christmas party to have after achieving this kind of feat.
Before I miss to tell you, DataQ Inc was born!