How a successful migration to open source stack happens?

This is a real success story and that’s how you migrate from money guzzlers to open-source money saver stack..g at the results they want to achieve. They made incremental changes in the code to arrive at the same result as the “tools” were producing. Once you compare the results coming in parallel for a month, during which DataQ can keep comparing and notify the differences. Then you could turn off the existing jobs.is continuously providing business value, though it’s a nightmare to maintain.

To make matters worse, it comes with expensive license terms, costs you seven hundred grands a year.

You ask your team how many jobs are there to migrate? They say 439 jobs to be precise.You get a really expert coder in, and she is able to write a code which does the 80-90% of code conversion from the “Tools” to Apache Spark and that’s bit of a saviour. You need the team to validate what converted code is doing is really ok and add the missing code so that you are able to prove that you have a code with which you can really turn-off the current job runs.

DataQ came in handy exactly at that point. Developers were able to make changes to the spark code, by looking at the results they want to achieve. They made incremental changes in the code to arrive at same result as the “tools” were producing. Once you compare the results coming in parallel for a month, during which DataQ can keep comparing and notify the differences. Then you could turn off the existing jobs.

This is a real success story and thats how you migrate from money guzzlers to open source money saver stack.

How we saved a ton of effort on data pipe code migration!

We worked on a Medical Records Analytics project and processed a monthly batch of data that arrived as CCDA files, very complex XMLs. You don’t want to analyze these using traditional excel because even parsing the file format would take a proprietary parser. We used many complex parsing techniques in java and stored the cleansed, processed data in a good old Oracle database.

A typical monthly batch would have records of 40 million patients. That means 40 million such files to process, and our batch would take between 8 to 10 days to complete. Same story every month!

One fine day we were done with supporting this beast and we decided to make way for Spark processing and HDFS storage. The whole idea rested on developers being able to efficiently and accurately convert the Java and PL/SQL code to Spark/Scala code. “And don’t forget to check that you get the same output, and obviously the batch would end within a day!” said the Project Manager.

Nevertheless, we started the code conversion drill and soon realized that the only way to do it is iterative. We, developers, needed something to tell us that we are making the right impact on data with every nudge and tweak we do. That’s where we built DataQ DataCompare.

After every code change, we would use the data difference report to know how effectively we are marched towards converting all the pipelines. The scrum master and project manager had actual numbers to report without reaching out for our help.

At the end of it all, we converted the whole code and released it to production within four months and guess what, the same monthly batch now takes just over 7 hours to complete. It was a great Christmas party to have after achieving this kind of feat.

Before I miss to tell you, DataQ Inc was born!

Data Fit to Purpose – Comparative analysis of Data

Data profiling is used to get the details about data from an existing data source and collect the statistics or summaries.

This activity helps in:

  • a. Find out the possible purposes of data.
  • b. Accurately tag data for discoverability
  • c. Know the expectations on quality
  • d. Know the expectations and challenges of joining the data with other data sources
  • e. Understanding data challenges for Master Data Management and Data Governance and to improve data quality.

When to Profile?

If you are convinced of the value of profiling data, you need to time it well during your development cycles to get optimal results. Few recommendations are as below:

– Do it several times during the requirements, development of the DWH project.

– Definitely, just after identifying any source of data.

– More intense profiling is needed to support dimensional data modeling.

– Intense profiling needed in the staging areas and the target tables populated.

How are you doing it?

Current approaches are mainly manual; depending on what you are comfortable with, you may:

– use python/R data frames to get the data profile metrics values.

– use custom ETL to get the values for your data.

– use your skills in excel to get such summaries from data that is small enough to fit in excel.

– use specialized tools to get the profile values, store them for future reference, compare them with different data sources, and track the profile changes with time.

The last approach has the benefits of getting more than just the basic profile information from the data.

DataQ’s profiler gives you the ability to get basic and advanced data profile information. The ability to compare data profiles across data sources and across time, streamline the data requirements, decision making during development, and align with your organization’s data governance goals.