Just about a month ago I was in the middle of my final project at Data Analytics Bootcamp. We were free to choose any subject and any tools we learned or wanted to learn before finishing the Bootcamp.
At the beginning of the Bootcamp I struggled to choose an interesting topic where I would like to apply my newly obtained skills and knowledge, everything in the scope of my interest seemed too abstract and unrelated to a study I’ve been doing. But the more I learned how to wrangle the data the more I realized that any topic is interesting in terms of Data Analysis.
So the choice of the subject of my final project was easy: I like good food and drinks, I make cocktails at home from time to time, especially during Corona times, so why not try to develop a machine that can create cocktails from the scratch?
Implementation of any project starts with obtaining an appropriate dataset. There are two main ways to do so:
- collect data from original sources;
- find ‘ready-to-use’ datasets.
For the sake of time I decided to go with the second option, likely there was a choice of datasets available on Kaggle.
During my short (so far) career in Data Analytics I haven’t seen a single real-life dataset that is immediately ready for use. The dataset I picked for my project was not an exception. I’m with no doubt grateful to its author for the huge piece of work this person did scrapping recipes of about 600 cocktails all across the Web. And I hope my work of cleaning and unifying data added value to this dataset, that’s why I decided to share it with anyone who’ll find it useful: Cocktail Data. Apart from aligning measurements across all recipes I also unified ingredients that way so a specific brand of, let’s say, gin or soda became irrelevant. This was an important condition for the learning model I used.
As the article’s title suggests I decided not to use machine learning technics for this project. One reason for that was the size of the dataset. After cleaning I had less than 500 cocktails which is extremely little for any machine learning model to do a proper job. The other reason was that at that moment I already had a high level perception of how my machine should choose ingredients for a brand new cocktail, and this didn’t involve machine learning.
Actually it was a pretty simple and straightforward mechanism — sort of limited randomization. An algorithm should select a random ingredient from the top 25% most common ingredients, then pick a pair for this ingredient based on all pairs known across the whole dataset, then pick the third ingredient based on the ingredient #2 using the same logic, and so on, and so on until the limit of total ingredients is not reached. As a little update I also added an option for a user to select the very first ingredient of the cocktail.
1. Preparation of dataset
To be able to implement an algorithm described above first of all I needed to have pairs of ingredients. Python library nltk (Natural Language Toolkit) helped me very much with this task. Once recipes of all cocktails have been split into pairs I could identify the most popular ones.
2. Choice of ingredients
Below are the steps that the algorithm performs to select ingredients for a cocktail:
- define total number of ingredients as a random choice from a range 3–6;
- randomly pick one of ingredients included in top 25% pairs;
- find suitable pair for this ingredient from all pairs, not only top 25%;
- do the same for the next ingredient but check that it’s not included already;
- repeat the previous step until the total number of ingredients is reached.
3. Identification of volume for each ingredient
For this step I first created lists of volumes per ingredient. I didn’t filter unique values only to increase chances of the most common volumes being picked by algorithm. Then the algorithm randomly chose volume per each ingredient and added this information to the generated recipe.
4. Selection of garnish
Maybe some of you know already that garnish is a very important part of a good cocktail. That’s why I decided not to exclude this step from my algorithm, although most of the recipes in my dataset didn’t include garnish. To make this step possible for the algorithm I identified the combination of the main ingredient and a garnish for each cocktail in my dataset. Again, I didn’t filter unique values only to keep the proportion of the most common combinations the same. Then the algorithm randomly selected garnish based on the main ingredient in the newly created recipe.
5. Specification of ingredients
If you remember, part of the data cleaning was unification of ingredients across the whole dataset to eliminate an impact of specific brands mentioned in some recipes. As a last step of creation of a new cocktail I decided to add this variety back. The process was similar to selecting volumes — first I generated lists of available brands per each type of ingredient and then the algorithm randomly selected one and updated the recipe.
I must say, the result of my project exceeded my expectations. First of all, the algorithm generated really decent cocktails. Of course, there were some funny recipes.
For example, my favorite one was:
- Light cream 15ml;
- Egg white 45ml;
- pinch of sugar.
Looks more like a liquid breakfast for people in rush rather than a proper cocktail, isn’t it?
But in most of the cases the cocktails generated were balanced and tasty.
Secondly, this project demonstrates that you don’t have to use a complicated and complex machine learning model every time, when you have a word ‘create’ in your task. Sometimes simple algorithms are able to manage these tasks as well.
And the last (but not least) outcome is that I learned a lot of new things that were outside of the scope of the Bootcamp. Mostly it relates to GUI (Graphical User Interface) which I decided implement the last minute because I really wanted to present my project as a simulation of the app:
If you are curious about technical details or would like to try the generator by yourself feel free to check out GitHub repository of this project: AI Mixologist.