I had 1 night to decide whether spending a large amount of time on Graph database will be fruitful.
I spent a large amount of time last year on developing a recommendation system for a Telecom client’s users; It turned out to be a massively difficult problem to undertake and accomplish in a short stipulated time. I was faced with a similar-sized problem last week and had a short quick around time for devising an initial strategy. I was well aware of the landmines that the data-driven methods have so I wanted to test another approach.
Last year, someone had mentioned Neo4j DB for Recommendation System but I didn’t pay much heed to it. I heard about Neo4j for the first time in 2016 when I downloaded it along with the Panama Papers data to expose the shameless tax avoiders and owners of the overseas shelf companies from my country. I ran a query here and there for an hour and a half while sitting at Cafe Jax on 84th St. but eventually, it met the fate of most hobby projects — swept under the carpet in an oblivious dimension.
Fast forward to propitious 2020, I decided to delve a little into Graph databases with the following factors to consider:
1. Run a truncated business problem rather than some tutorial.
2. Can I get meaningful information in a short amount of time?
3. How scalabale this entire thing is on my moderately powerful machine?
4. How flexible it is in comparison to the pythonic approach? Data manipulation, feature generation etc
I started at around 7:30 pm, shortly after dinner and finished at around 5:20 am. I relied heavily on the docs and examples on the documentation website.
1. Download and Install
https://neo4j.com/download/ is where one can find the installer; one has to fill a small form before it lets us proceed to download(another data collection ploy).
After installation, you will be welcomed with open arms in the Neo4j community(at least that’s what the prompts on the screens say) and if you don’t have time for faffing around, you will diligently close all such paraphernalia and get to business straight away.
2. Set up a Graph DB
I am sure you can create a new Project and in that, you will have to create a database. Click on Add Database
Give a username and password of your choice and click on Start. Once it is running, click on Open.
3. Data prep and Data location
First of all where to place your data? If you are using macOS then /Users/<your user folder>/Library/Application Support/com.Neo4j.Relate/Data/dbmss/<folder related to the DB you created above>/import/
Place your .csv files in the import folder.
<folder related to the DB you created above> — If it is your first project then you will have only one folder under /dbmss, so place your .csv in the /import there nonchalantly and audaciously.
(Only for mac users: The above folder is much easier to find on Windows or Linux as in macOS the /Users/<your user folder>/Library is hidden, so you can type /Users/<your user folder>/Library in spotlight search and get to the folders)
I scrubbed my data heavily and took only 1% of it for the experiment.
You can get all the .csv files from the GitHub here.
Service_Providers.csv contains Telecom service provider specialising in one of the Telco product such as Fiber, DTH, 4G LTE etc.
Uses.csv maps the Service Provider in the above file to Major Telecom players(known as Local Partners) in different Geographies.
Similar.csv has data on which major Telecom players are similar to each other.
4. Formulate problem statement in terms of data above
With the help of Neo4j, Data sources described above, the tooth fairy, and black magic, can we recommend service providers and products to the Major Telecom players in this B2B setting?
5. Let’s play
In step 2, you had opened the Neo4j browser. It looks something like this.
We can type commands next to neo4j$ prompt.
Just like there is SQL in the universe, neo4j has its own language CQL called Cypher Query Language. I won’t call it much of a pain but I touched only a small portion of it, so what do I know?
With the three CSV files in place, I ran the following to create the nodes and relationships.
LOAD CSV WITH HEADERS FROM "file:///service_providers.csv" AS row
MERGE (pName:provider_name {name: row.Provider})
MERGE (pGeog:provider_loc {name: row.Geography})
MERGE (pServs:provider_serv {name: row.Services})
MERGE (pName)-[:Located_In]->(pGeog)
MERGE (pName)-[:provides]->(pServs)
LOAD CSV WITH HEADERS FROM "file:///uses.csv" AS row
MERGE (clientN:client_Name {name: row.Local_Partner})
MERGE (pName:provider_name {name: row.Provider})
MERGE (clientN)-[:Uses]->(pName)LOAD CSV WITH HEADERS FROM "file:///similar.csv" AS row
MERGE (clientN:client_Name {name: row.Local_Partner})
MERGE (userN:client_Name {name: row.User})
MERGE (clientN)-[:Is_Similar]->(userN)
The sidebar of the database will have the information of all the nodes that are created and all the relationships that exist between the node.
These nodes are queried upon and the relationships are used as filters in the CQL.
#FunTimesBegin
Run the following command the neo4j prompt
Match(n) Return(n)
This is neat!
The visual representation tells me who is connected to whom with what underlying relationship. Such visuals can be great for storytelling and the business audience.
I suspect that this graph will look really messy when the number of nodes is high.
6. Recommendations
This graph contains all the info of the data and we would use CQL to unearth those relationships. We can find similar entities, what do they have in common, what products do they use etc.
Let’s take the case of ‘Boston Locals’ which is one of the Major Telecom Player(known as Local Partner).
#Other partners similar to ‘Boston Locals’
MATCH (boston:client_Name{name:"Boston Locals"})-[:Is_Similar]-(client_Name)
RETURN client_Name.name
Two other Major Players are similar to Boston Locals.
#Find products and local providers that are used by similar major players.
MATCH (boston:client_Name {name:"Boston Locals"}),
(boston)-[:Is_Similar]-(partner),
(provider:provider_name)-[:Located_In]->(provider_loc),
(provider)-[:provides]->(provider_serv),
(partner)-[:Uses]->(provider)
RETURN provider.name, provider_loc.name, collect(partner.name), provider_serv.name, count(*) as count
ORDER BY count DESC
In the above query, collect function will create a list of partners.
The above work in Neo4j works as what is called Collaborative Filtering in the Recommendation Systems space. One finds the similarity between items, users, user-items and uses the space to recommend items, products, or services.
This isn’t sophisticated as embeddings, neural networks, matrix factorisation but if the problem isn’t esoteric then why not go for a simpler solution!
7. Pythonic ways
It turns out that neo4j can interact with python via a driver.
pip install neo4j
Once that’s done you can easily call neo4j current DB session in python file(make sure that DB is running otherwise you will get ServiceException errors)
from neo4j import GraphDatabaseuri = "neo4j://localhost:7687"
user = "neo4j"
password = "hello@123"
driver = GraphDatabase.driver(uri, auth=(user, password))
session = driver.session()
Then you can define a function that uses the above session to run queries. The file is available on my Github here.
One can look at the recommendations through a simple print statement.
After an initial litmus test and a tiring night, I was pleasantly surprised with the results and the capabilities of Neo4j.
For the questions that I intended to find answers to:
1. Can I get meaningful information in a short amount of time?
Definitely! The visual information is advantage in understanding the deeper relationships in the data. It also helps in the vernacular that is easily explainable and comprehensible with the data.
2. How scalabale this entire thing is on my moderately powerful machine?
I ran it on my machine with 16 G RAM, 512 G HD, i7 6 Core; I tried running a file with 200K rows and 5 columns (all numeric data) and I got Java heap space error, decreased the file size but kept on getting the error till 70K rows. I can easily use pandas dataframe or turicreate’s Sframe without batting an eyelid on those files on my machine. So, at the moment I am skeptical of scalability.
3. How flexible it is in comparison to the pythonic approach? Data manipulation, feature generation etc.
Here I used a classic use case which can be solved with basic manipulations but in an indusstrial setting with increasing complexity, merely similarity doesn’t yield good results. One needs to concoct feature spaces such as embeddings which is possible in Neo4j but I haven’t explored that. Neo4j Graph Data Science shows promise.
At this moment, I would like to include Neo4j in my Data Science life cycle during the exploratory data analysis phase to form the hypotheses that I can test using the usual pythonic ways.
With the help of CQL, I can find all the records that exhibit certain characteristics and I can test the consistency of the results obtained from the classical methods.
Epilogue: It was a productive night, time to sleep!