INFORMATION ACTION: actionable ideas for evidence-based decision-making: Do you trust in the Dark Arts?

Thursday 22 May 2014

Do you trust in the Dark Arts?

Why estimating Data Quality profiling isn't just guess-work

Data Management lore would have us believe that estimating the amount of work involved in Data Quality analysis is a bit of a “Dark Art,” and to get a close enough approximation for quoting purposes requires much scrying, haruspicy and wet-finger-waving, as well as plenty of general wailing and gnashing of teeth. (Those of you with a background in Project Management could probably argue that any type of work estimation is just as problematic, and that in any event work will expand to more than fill the time available…).

However, you may no longer need to call on the services of Severus Snape or Mystic Meg to get a workable estimate for data quality profiling. My colleague from QFire Software, Neil Currie, recently put me onto a post by David Loshin on SearchDataManagement.com, which proposes a more structured and rational approach to estimating data quality work effort.

At first glance, the overall methodology that David proposes is reasonable in terms of estimating effort for a pure profiling exercise - at least in principle. (It's analogous to similar "bottom/up" calculations that I've used in the past to estimate ETL development on a job-by-job basis, or creation of standards Business Intelligence reports on a report-by-report basis).

I would observe that David’s approach is predicated on the (big and probably optimistic) assumption that we're only doing the profiling step. The follow-on stages of analysis, remediation and prevention are excluded – and in my experience, that's where the real work most often lies! There is also the assumption that a pre-existing checklist of assessment criteria exists – and developing the library of quality check criteria can be a significant exercise in its own right.

However, even accepting the "profiling only" principle, I’d also offer a couple of additional enhancements to the overall approach.

Firstly, even with profiling tools, the inspection and analysis process for any "wrong" elements can go a lot further than just a 10-minute-per-item-compare-with-the-checklist, particularly in data sets with a large number of records. Also, there's the question of root-cause diagnosis (And good DQ methods WILL go into inspecting the actual member records themselves). So for contra-indicated attributes, I'd suggest a slightly extended estimation model:

10mins: for each "Simple" item (standard format, no applied business rules, fewer that 100 member records)
30 mins: for each "Medium" complexity item (unusual formats, some embedded business logic, data sets up to 1000 member records)
60 mins: for any "Hard" high-complexity items (significant, complex business logic, data sets over 1000 member records)

Secondly, and more importantly - David doesn't really allow for the human factor. It's always people that are bloody hard work! While it's all very well to do a profiling exercise in-and-of-itself, the result need to be shared with human beings - presented, scrutinised, questioned, validated, evaluated, verified, justified. (Then acted upon, hopefully!) And even allowing for the set-aside of the "Analysis" stages onwards, then there will need to be some form of socialisation within the "Profiling" phase.That's not a technical exercise - it's about communication, collaboration and co-operation. Which means it may take an awful lot longer than just doing the tool-based profiling process!

How much socialisation? That depends on the number of stakeholders, and their nature. As a rule-of-thumb, I'd suggest the following:

Two hours of preparation per workshop (If the stakeholder group is "tame". Double it if there are participants who are negatively inclined).
One hour face-time per workshop (Double it for "negatives")
One hour post-workshop write-up time per workshop
One workshop per 10 stakeholders.
Two days to prepare any final papers and recommendations, and present to the Steering Group/Project Board.

That's in addition to David's formula for estimating the pure data profiling tasks.

Detailed root-cause analysis (Validate), remediation (Protect) and ongoing evaluation (Monitor) stages are a whole other ball-game.

Alternatively, just stick with the crystal balls and goats - you might not even need to kill the goat anymore…

INFORMATION ACTION: actionable ideas for evidence-based decision-making

Pages

Thursday 22 May 2014

Do you trust in the Dark Arts?

No comments:

Post a Comment