This week we welcome Guest Blogger, Bill Grant from the AllGo Analytics Limited. Bill is an experienced data professional with a new business focused on performing data analytics for small and medium-sized business.
Take it away Bill …
FACT SHEET 1: What is Market Basket Analysis?
There are many ways to see the similarities between items. These are techniques that fall under the general umbrella of association.
The outcome of this type of technique, in simple terms, is a set of rules that can be understood as: “if this, then that”.
So what kind of items are we talking about? There are many applications of association:
- Product recommendation – like Amazon’s “customers who bought that, also bought this”
- Medical diagnosis – like with diabetes
- Content optimisation – like in magazine websites or blogs or even menu design for a restaurant
- DNA genome analysis – patterns in cellular data
- Fraud detection in finance
Show me an example of its use in real life
You have 30 days of transaction receipts from your grocery – a total of nearly 10,000. How can you use this valuable information to help uncover relationships between seemingly unrelated data – what goes with what? How might I lay out my store for customer convenience?
In this example you have:
9,835 transactions, ie that many separate purchases over the month. This represents around 30 transactions per hour over 12 hour trading days
169 possible individual products – this is the range of different items you stock
33 individual items in the largest transaction – this is how many items were in the biggest sale
This is too much to calculate manually or try to understand visually, we want to automate this and can do it using Association Rules, to find the patterns of association between items in this large data set.
With the right software and having the data in a simple table, we can easily find these relationships with minimal expert assistance and repeat the same analysis on other months’ data (i.e. we can replicate the process).
Using RapidMiner Studio (available free on-line from RapidMiner GmbH) we can set up the linked set of analytic modules and get the results we need in literally seconds. For our grocery data we see:
These are the strongest rules out of the 170 that were found. Clearly, whole milk is a key item in combination with others.
Noting the following technical terms from the table of rules above:
- Support: The fraction (%) of which our item set occurs in our dataset.
- Confidence: The probability that a rule is correct for a new transaction with items in the If column.
- Lift: The ratio by which by the confidence of a rule exceeds the expected confidence.
Note: if the lift is 1 it indicates that the items on the “If” and “Then” sides are not linked to each other. A lift greater than one means a positive relationship.
The graph below shows the relationship rules visually, i.e. in the top right-hand corner we have yogurt and curd together (the “If”) then linking to whole milk (“Then”).
Technical Process Visualisation
- 1. “Read CSV” – import the data in csv (from Excel) format
- 2. “Numerical to Binomial” – convert the data, which was a series of 1’s and 0’s for whether each item from the range is in a transaction (1) or not (0) to equivalent “true” and “false”. This change of format is required for the next step.
- 3. “FP-Growth” – pass the data through this algorithm module to generate the frequent item data sets (FP = frequent pattern and it uses a tree data structure)
- 4. “Create Association Rules” – use the output from step 3 to generate the rules shown above in this Fact Sheet to generate the “If, Then” rules and technical calculations (lift and the others) to describe the patterns found
RapidMiner is a software platform developed by the company of the same name that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. It is used for business and industrial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including results visualization, validation, and optimization.
RapidMiner is written in the Java programming language. RapidMiner provides a GUI to design and execute analytical workflows. Those workflows are called “Process” in RapidMiner and they consist of multiple “Operators”. Each operator is performing a single task within the process and the output of each operator forms the input of the next one.
In 2014, Gartner Research placed RapidMiner in the leader quadrant of its Magic Quadrant for Advanced Analytics. The report described RapidMiner’s strengths as a “platform that supports an extensive breadth and depth of functionality, and with that, it comes quite close to the market Leaders. RapidMiner has received over 3 million total downloads and has over 200,000 users including eBay, Intel, PepsiCo and Kraft Foods as paying customers. [Sourced from Wikipedia, September 2015]
Graphical rule visualisation used: KK Layout format
Thanks Bill for being a guest blogger.