"An extraordinary thinker and strategist" "Great knowledge and a wealth of experience" "Informative and entertaining as always" "Captivating!" "Very relevant information" "10 out of 7 actually!" "In my over 20 years in the Analytics and Information Management space I believe Alan is the best and most complete practitioner I have worked with" "Surprisingly entertaining..." "Extremely eloquent, knowledgeable and great at joining the topics and themes between presentations" "Informative, dynamic and engaging" "I'd work with Alan even if I didn't enjoy it so much." "The quintessential information and data management practitioner – passionate, evangelistic, experienced, intelligent, and knowledgeable" "The best knowledgeable, enthusiastic and committed problem solver I have ever worked with" "His passion and depth of knowledge in Information Management Strategy and Governance is infectious" "Feed him your most critical strategic challenges. They are his breakfast." "A rare gem - a pleasure to work with."

Thursday 19 December 2013

The Twelve (Data) Days of Christmas

In the spirit of the season, here’s a light-hearted take on some of the challenges you might be facing if you're dealing with increased volumes of data in the run up to Christmas…

On the first day of Christmas, my true love sent to me a ragged hierarchy.

On the second day of Christmas, my true love sent to me two empty files and a ragged hierarchy.

On the third day of Christmas, my true love sent to me three null values, two missing files and a ragged hierarchy.

On the fourth day of Christmas, my true love sent to me four audit logs, three null values, two missing files and a ragged hierarchy.

On the fifth day of Christmas, my true love sent to me five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.

On the sixth day of Christmas, my true love sent to me six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.

On the seventh day of Christmas, my true love sent to me seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.

On the eighth day of Christmas, my true love sent to me eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.

On the ninth day of Christmas, my true love sent to me nine duplications, eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.

On the tenth day of Christmas, my true love sent to me ten coding errors, nine duplications, eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.

On the eleventh day of Christmas, my true love sent to me eleven double meanings, ten coding errors, nine duplications, eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.

On the twelfth day of Christmas, my true love sent to me twelve domain outliers, eleven double meanings, ten coding errors, nine duplications, eight mis-spelled surnames, seven free-text columns, six late arrivals, five golden records! Four audit logs, three null values, two missing files and a ragged hierarchy.

Seasons gratings!

ADD

(This post also to be published on the MIKE2.0 Open Methodology site)

Sunday 15 December 2013

Opening Pandora’s Box


Is it too late for Data Governance?
I responded recently to a question in a post on LinkedIn, on whether Data Governance should drive how Data Management evolves.

In summary, the consensus on the thread was “Yes, it should.” (Life isn’t quite that simple of course, especially when the Data Governance function probably doesn't even exist in the first place!) The other consideration being, that any activity in the data management space really should link back to, and be driven by, a clear business need. (See my earlier post “Information as a Service” for more on that subject.)

“What’s that got to do with Pandora’s Box”, I hear you ask? Bear with me…

In responding to the thread, I started thinking more generally about the need for Data Governance as an organisational capability. In particular, I began comparing the situation as it used to be when I entered the workforce with the scenarios we find ourselves dealing with today. (Hark at Old Man Duncan…)

And after a bit of pondering, I’ve come to the conclusion that in many respects, we may well have actually gone backwards in terms of data quality and business responsibility for data!

Some perspective. I started working in 1992, when server-based database applications running on DEC/VAX and early Unix minicomputers were order of the day. Users accessed the systems using green-screen terminals with forms-based text-only interfaces, exclusively using a keyboard. PCs had evolved to the point where DOS-based personal productivity applications were reasonably prevalent (WordPerfect was the word processor of choice, and SuperCalc was the market leading spreadsheet). Programming was all done with text editors, and there was not a “Window” ™ or mouse in sight.

In hindsight, guess I was fortunate, in that I joined the “new wave” of business systems developers working with 3GLs and RDBMS – Oracle V6, Forms3.0, Pro*C ReportWriter to be precise.

And SQL. Lots, and lots and LOTS of SQL.

Basically, I learned an awful lot about how to decompose, model, structure and interact with data – the foundations for a career in what we now call Information Management. I guess I can be satisfied that (to date at least) I’ve not been found out. Indeed, some people actually seem to like what I’ve done and what I’m still doing. Yay me...!

Anyway, what all this meant was that a lot of effort went into developing these business systems, and more importantly, into ensuring that the data definitions were well understood. Not just for referential integrity purposes, but at the member record level too. Many businesses still had dedicated data processing teams (clerks) who were responsible for – and took pride in – the accuracy and completeness of the data.

And because this was more or less their only role, they were damned good at it; diligent, conscientious and fast. Result? High-quality data that could then be queried, reported, and acted upon by the business. Everything was pretty focussed on execution on whatever process mattered, and the computer systems were simply there to speed up and ensure rigour of the recording process.

Speed forward twenty years and the world looks like a pretty different place.

We’re living in a mobile, connected, graphical, multi-tasking, object-oriented, cloud-serviced world, and the rate at which we’re collecting data is showing no sign of abatement. If you’re in Information Management, then that’s got to be a good thing, right?!

Not so fast young Grasshopper...

While the tools, technologies and methods available to us are so much more advanced and powerful than those green-screen, one-size-fits all centralised systems of the mid-eighties and early nineties, I think our progress has come at a significant (unacknowledged or even unrecognised) cost. Distributed computing, increased personal autonomy, self-norming organisations and opportunity for self-service were meant to lead to better agility, responsiveness and empowerment. The trade-offs are in the forms of dilution of knowledge, hidden inefficiencies, reduced commitment to discipline and rigour, and unintended consequences. 

And people these days have the attention span of the proverbial goldfish. (Ooh look, a bee…)

Which leads me to my conclusion. That the advent of "Data Governance" as an emerging discipline (and indeed other forms of governance - process, architectural, security etc,) could be considered as a reactionary an attempt to introduce a degree of structure, moderation and resilience into this ever-evolving state of business entropy.

Can we succeed? Or are we trying to close the data version of Pandora's Box…?

We can but hope.

Friday 6 December 2013

The Religious Warfare of data modelling


Do you need to choose between 3NF and Dimensional techniques?
In reply to a recent post on LinkedIn, I commented on the question of using 3rd Normal Form (3NF) vs Dimensional modelling as the underpinning design technique for the core data warehouse (often characterised as “Inmon vs Kimball”).

If I had a dollar for every time I’ve had a discussion with a data modeller or architect on this, then – well, lets just say I’d have a few dollars more than I’ve got right now. (Having kids didn’t help the bank balance either), There’s still a lot of emotional investment tied up in this, and some pretty entrenched positions, to the point where it can almost seem like religious warfare, rather than healthy debate. The architectural zealots are out in force, the modelling jihadists seek to impose one approach or another, and data dogma takes precedence over analytical enlightenment of rationality and pragmatism. (One time my colleague Matthew De George and I were discussing the collective nouns for different groups of things, and Matthew suggested that the term for a group of data architects should be an “argument”…)

And like most wars of doctrine, we’re really fighting about variations on the same theme, rather than fundamental differences of belief. (As the British satirical TV show Spitting Image once put it, “myGod is Bigger than Your God.”)

Now, when database technologies were still in their infancy with respect to supporting data warehousing functions (as distinct from OLTP applications), there was some degree of merit in choosing one approach or the other. Certain DBMS platforms were more suited to applying one approach or the other if you wanted to achieve a level of acceptable query performance.

So we've continued to spend the past twenty-odd years looking at the engineering of the corporate data factory [(c) Bill Inmon…] and there is absolutely still a place for a rigorous, robust, integrated, consistent and compliant point-of-access for reporting and analysis. Unless of course you’re a Kimballite (Kimballist? Kimballian?), and reckon on dealing with any data compliance issues that may arise with a devil-may-care nonchalance.

Except I don’t think it’s as simple as sticking to one approach or the other, at least not any more. The physical structuring of data isn’t nearly as important as it once was, at least, not in the first instance (Hadoop, anyone?!). Database technologies have moved on a whole heap in the last 20 years and the advent of the data warehouse appliance means that sheer technological brute force can derive excellent query performance without further ado.

But more to the point, many analytical scenarios actually only need an indicative level of accuracy to steer the business decision-making process - and in those cases, a quick-and-dirty "sandpit" delivery may well be more appropriate. If there's one thing I've learned in this data management game of ours, it's that delivering an early outcome gains goodwill and buys you the time to then go on and engineer a repeatable operational solution – you can do “quick and dirty” and “fully engineered” in parallel. (What I sometimes call the "do it twice, do it right" approach.)

So, what we should be giving our careful thought to is not so much the design method per se, but rather the delivery trade-offs that are necessary with respect to delivering iterations of the analytical solution quickly and responsively so that you reduce time-to-value, versus the costs to rigour, resilience and auditability (sometimes referred to as "technical debt"). This means delivering incrementally, being pragmatic, and avoiding introducing levels of complexity whenever possible.

In most circumstances, I'd say a hybrid approach is the way to go.

(As an example, I remember one situation a few years back when I was working for a UK telecoms retailer and service provider. Out of the blue, an entrepreneurial business came to market with a very cheap handset-and-calls bundle, marketed in a way that was clearly targeted at eating our lunch. Our usual EDW process would have taken about six weeks to engineer the scenario models that we needed to evaluate the impact of this threat. Instead, we dumped an ad hoc copy of our call records, re-processed them based on the competitor's new tariff, and responded with a lower-priced counter-offer to our customers that meant we retained the customer base and killed the competitor's move stone dead. We then took what we’d learned from dealing with the one-off scenario to engineer a tariff modelling solution which enabled us to deliver what-if analysis as a standard function of the warehouse.)

Also, don't forget that the techniques of 3NF and dimensional modelling are not only about delivering a physical design!

In this respect, I think sometimes that Ralph Kimball has actually done us all a bit of a dis-service in conflating both logical modelling and physical modelling processing into a single step.

Logical modelling of data is about developing a rigorous and shared understanding of the nature of your data – it’s semantic meaning - and Bill Inmon still makes an excellent case for the need for structuredapproach to deriving this understanding. Both normalisation and dimensional techniques are vital to properly inform, analyse and communicate a shared and rigorous intellectual understanding of your data. Therefore BOTH methods need to be applied to your business scenario, before you then make any decisions about the design approach that you will adopt within physical data management environment.

It is a separate step to then make the physical design choices about how to engineer your data structures, and the decisions will be entirely dependent upon the type of products used within your solution.

Amen.