Called "Tesco Grocery 1.0, a large-scale dataset of grocery purchases in London", it came out in 2020 and as its name suggests it studies the food consumption of Londoners based on Tesco purchases. Tesco is by far the largest grocery store chain in the UK. Here are some key numbers related to the study.
- 420M Transactions
- 1.6M Card Owners
- 411 Stores
- 2015 Yearly Data
All the stores are within the boundaries of the Greater London Area. The purchases are recorded through the customers' Clubcard fidelity number and then anonymized. The data is aggregated at three different levels.
- Lower Super Output Areas (LSOA): contains around 2000 residents each, by design
- Middle Layer Super Output Area (MSOA): around 7200 residents
- Wards: even bigger aggregration
From the purchases, nutritional facts were extracted from the food and then averaged for each area. This allowed the researchers to compute typical food comsumption for a resident in a specific area.
The researchers then used this data and compared it to other sources such as the prevalence of diabetes and obesity in each area and then computed the correlation between the average nutritional facts and the prevalence. This gave conclusions such as a high carb diet associates with a high probability of having diabetes.
After reading this study, we were left wanting for more and felt the data at hand had more potential.
What if we could take it further?
We notice that some of the data was completely unused. One of them was alcohol consumption.
Well let's dive in some data to figure this out.
The idea is to study if the alcohol consumption rate is related to some social caracteristics of an area. To do so, we decided to use several indicators given by the Office for National Statistics of the UK:
- Unemployment Rate: % of unemployed people in a given area
- Mean Annual Household Income estimate: Given in £ for each area
- Median Annual Household Income estimate: Given in £ for each area
- No Qualification: % of people without official qualification in a given area
- High Qualification: % of people with level of qualification of 4 or above in a given area
In order to show and quantify possible relations, we computed Spearman correlations between the quantity (in g) of alcohol in a typical product of an area and indicators described above. Results are plotted on Figure 1.
We found the results a bit surprising at first. We thought that people with a lower income and education would be more inclined to have a higher alcohol consumption. We have difficulty figuring out where this preconceived opinion comes from. The results are clear. High negative correlation for the unemployed and unqualified and high positive correlation for the employed, qualified and wealthy. So have we answered the question? Not so fast.
Our dataset could be biased. What if a higher percentage of people shop at Tescos in affluent neighborhoods? This would imply higher alcohol purchases in those aread and that Tesco is not as popular in other neighborhoods. Luckily for us this can also be checked as the data includes a reprensativeness score. This score transaltes to how much the data is representative in a certain area. It equals the number of customers divided by the total number residents. We run a Spearman correlation between the representativeness of an area and its Median Annual Household Income.
We get a result of R = -0.03
This is quite a low correlation compared to the values we got on Figure 1. This tells us that it is generally not true and disproves this hypothesis. So how can we explain those results?
On way of interpreting such results could be to consider that alcohol products are generally expensive so well-off people can more easily afford them. Another possible explanation those results would be that Tesco is generally a more expensive grocery store than some other chains out there and that people with a smaller income tend to be more cautious with their money and might go to another cheaper store to buy alcoholic beverages.
Is there anything else we can extract from our dataset?
We also notice that we have the average diet of each area. An obvious and interesting question we can ask is
To answer this question we need to define first what a healthy diet is.
There is many many definitions out there originating out of different sources ranging from health gurus to fitness magazines. We arbitrarly chose the following recommendations from the WHO but note that the analysis could be applied to any values by changing them in our Jupyter Notebook.
For a person of normal weight consuming about 2000 Kcal per day:
To obtain the calories intake at the nutrient level, we used the conversion factors set by EU directive 90/496/EEC41: 4 kcal for proteins and carbohydrates, and 2kcal for fibers. We then considered a 69Kg (4) average body weight for a resident in the UK. Then we obtained 0.83g/Kg · 69Kg · 4Kcal/g = 229 Kcal which gives us a 11.5% protein intake over a mean intake of 2000Kcal/day.
Now that we have our model diet let's visualize with histograms how the London population is doing.
Well not that brilliant, innit?
Before going further we want to check if the study we are about to do makes sense by looking at the correlation between the mean nutritive intake and the social factors we used for alcohol. Is there a link ? We plotted the correlation for every relation, which turned out to give a lot of plots and we could visually observe correlation. Here is a good example where we can obverse a trendline with a negative slope.
This a good primer for our study to validate that there might be some interesting things to discover and it might be worth investigating further.
We will be using a scoring system to determine if a certain area is close to the recommendations we chose. This score will consist in a number between 0 and 1, 1 being a perfect score. Please note that this score is entirely relative to the dataset, meaning a low score indicates that an area has a bad mean diet compared to the other areas.
How do we compute this score?
We split up the scoring into a score for each of the 6 nutrients listed in our model recommendation. To compute a score for a nutrient, we first remove the outliers with a 2·std interval and determine if those outliers are in the zone of the recommendation (they will get a score of 1) or completely out of the zone (they will get a score of 0). Now with our cleaned data we score it on the basis of the furthest datapoint from the recommendation getting a score of 0 and the rest having a linearly scaled score. Let's visualize our results, here an example with the 'fat' scores.
Once we have a score for every nutrient we simply average the nutrient scores for each area to get our final score. Let's visualize our results!
At first sight, this giant pink mosaic does not seem to carry a lot of information... But let's take a closer look ! It appears that the most 'healthy' diets are to be found in the bounday area of our map. We can also observe interesting clusters in some regions. This is good ! Indeed, we already know that social characteristics have spatial patterns because people with similar social indicators tend to live in proximity. So observing patterns in food consumption is a good start when finding a correlation between the two. This is good to have a visual idea of the data but now we can use our scores with the same socioeconomic factors as we did for the alcohol part! We computed correlation between those factors and our total average score and found that, sadly, there was no significant correlation. This means that we can't for example say 'educated areas eat a more healthy diet'.
Let's dig deeper.
Instead of having just a 'healthy or not' approach. We can analyze those socioeconomic factors with the score we computed for smaller combinations of scores and compute for each socioeconomic factor the 5 combinations of scores that yield the higher correlation. Let's plot those correlations and see what we find!
Interesting, the first nutrient that pops out is fiber. Here is clear evidence in the 4 plots we have that fiber is highly correlated with wealth and education! We see that the same can be said when it is in a combination with sugar. This translates in plain english that the wealthier areas tend to consume a more suitable percentager of sugar and fibers compared to lower income and lower education areas. Interestingly, the opposite can be said for fat and saturates where well-off neighborhoods consume too much of it. This contrasts to areas with less means where they don't have enough fat and saturates in their diet.
Let's focus on the sugar+fiber score.
We did the whole study filtering the areas with a representativeness lower than 0.1. Let's see what happens if we gradually remove the less representative areas and compute the spearman correlations between the average sugar+fiber score and the social indicators. The following plots show the evolution of the spearman correlations coefficient given different threshold values on representativeness, it also shows the p-values for each case in order to make sure the results remain statically significant (shown by the limit line at p=0.05).
The R coefficients show an increasing slope and the p_value stays lower than 0.05. The fact that our results improve when using more representative data is one more proof of the existing correlation. The p_value rises when reaching a 0.6 threshold on representativeness because the number of areas drastically decreases...
We were able to show that the nutrients intake are indeed correlated with the social indicators. We discovered that the overall eating habits of Londoners are quite far from what the WHO recommendations, computing a score taking into account the difference between nutrients consumption and recommendations allowed us to link bad eating habits with social indicators. However, higher income or higher education level does not necessarly mean a better alimentation, it depends on which nutrients are taken into account.
Ken Pillonel, Master in Robotics
Nathan Kammoun, Master in Robotics
Nathan Masnada, Master in Energy