Where do you start a data project? Often we want to ‘get into the data’. After all, we are data people; we got into this space because we like data.
That isn’t what Shaun did while working for a customer last year on a project that needed him to make sense of 13 million files. He stopped. He stopped thinking about the data and started thinking about the business. He started thinking about Who in the business Does What? This led him to an elegant solution that the customer loved.
This blog is an update to one written by Shaun McGirr.
I was given complete discretion on how to turn millions of files into useful information for decision-makers. These files happened to record the details of academic articles; publication name, date of publication, author and so on. So I opened a few of them in Notepad++ and browsed through. My natural instinct was to open R, ingest some files and start trying to code my way to a solution. A voice in the back of my head said: “that’s a long road to nowhere useful” and so I paused for just long enough for an alternative to appear to me.
What Who does what? means for an Agile data warehouse.
I asked myself “who does what?”. If you’ve been following our blogs on Business Event Analysis and Modelling (aka BEAM✲) you might recognise that “who does what?” is how we start conversations with stakeholders about their business. We start with this question because the easiest way to measure a business is to understand the processes that drive it. Data requirements gathered this way, from the business stakeholder’s perspective, are so much more useful than pages of “reporting requirements”.
In my case, the first answer that came to mind was “journal publishes article”. So I opened a BEAM✲ template, entered those headings, and filled out the columns using the file in front of me.
I realised, that the authors aren’t really part of the “journal publishes article” event but belong to another “who does what?”: “author writes article”. Of course, my data didn’t show me when the actual work happened, but modelling authorship as a separate event allowed me to deal with the multiple authors issue elegantly.
The first-best option is to speak to business stakeholders but I had to do this first part on my own, using only the data I was provided.
But what did the customer think?
I used the same completed BEAM✲ templates to cross-check my understanding of the data with the requirements of my customer.
The customer saw three things immediately about the templates:
- They represented the various parts of the complex research process without jargon or pretense
- They contained real data
- To the data warehousing team, these requirements looked like a dimensional model
All this, without me having to code my way through a jungle of files. BEAM✲ is versatile!
Until next time, keep asking better questions (like “who does what?”)
Shaun – @shaunmcgirr
Shaun blogs about analytics, machine learning and how data solves problems in the real world.
Lawrence Corr (@LawrenceCorr) and Jim Stagnitto (@JimStag) introduce the BEAM✲ methodology in their book Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema (Amazon, eBook)