In Part One of this post, you read about what a recommender system is in Machine Learning, the different classes, and the steps involved in building a recommendation engine.
In this second and final post, we will discuss how to build a recommendation engine using Neo4j, a graph database management system.
What is Neo4j?
Essentially, Neo4j is a graph database management system that also provides tools to visualize and extract insights from the graph.
The most significant advantage of using a graph data model is that one does not have to connect entities in the data using special properties, such as foreign keys. In graph databases, it is easy to understand the relationships between entities, as the structure of the stored data is not highly structured, yet well-organized and perceptible.
Imagine having a system on your computer that can create a database that mirrors the business model you drew on your management whiteboard? With Neo4j, we can connect all entities in a business using the graph database model.
This allows us not only to focus on the entities but also on how they interact with each other, giving a clearer picture of the bigger picture.
The world we live in today is a connected ecosystem, whether it's the business you run or your Facebook account. All the entities around us are connected in some way or another through various relationships.
Unlike other database management systems, graph databases represent data as nodes and edges. The nodes represent different entities in the data, and the edges represent the relationships between them.
This helps us maintain and visualize the data in the most natural, standard form, unless you are someone like Neo from The Matrix.
Basically, these two charts represent a hypothetical example with the following pieces of information:
From the text alone, one can easily infer that relationships connect the entities. Now, let us consider the above two diagrams individually. To store the complete text information in relational tables, we had to create a star schema.
However, there are some problems with this approach. First, it becomes challenging to populate all the tables with the given information. If someone wishes to add new information, it must also be added to the corresponding dimension table.
Whereas if you consider carrying out similar tasks in Neo4j, all you need to do is write a Cipher statement for the respective node or relationship. As far as updating any information on any node/relationship is concerned, all one needs to do is write that information on that particular node/relationship only.
Moreover, as data complexity increases, the effort required to perform CRUD (Create, Read, Update, Delete) queries also increases. For example, if we wish to write a query to find out which people in the data are colleagues, then in SQL, the query would be:
Select employer. Employer, employee.name
from the employee, employer
where employee.employer_id=employer.employer_id
And in cypher, the same could be achieved by writing:
match (p1:Person)==[:Works_at]==>(c1: company)<==[:Works_at]==(p2: Person)
Return p1, p2, c1.
Neo4j vs MongoDB vs MySQL:
Since Neo4j is a NoSQL database, let's discuss how it differs from some of its contemporaries in its basic architecture and applications.
Now that we know how graph databases store data compared to other NoSQL databases, let us examine the specific case of Neo4j in the context of the CAP (Consistency, Availability, Partition Tolerance) trade-off, compared to other NoSQL databases.
Unlike most aggregate data stores, such as column-family, key-value, and document stores, which have BASE consistency, Neo4j has ACID consistency, ensuring that the database is entirely consistent and that only atomic and isolated transactions are executed.
Also, the results of these transactions stay unchanged even after the server fails or restarts.
To remain highly ACID-compliant, Neo4j uses a master-slave architecture for data replication, in which a single node in a cluster handles all write transactions. As a result, Neo4j falls short when performing multiple write operations simultaneously.
Here's a real-life example of the same: at my workplace, my team and I wanted to scan a relationship to determine whether each customer had at least one common event with the other 38000 customers.
Fortunately, there were only about 0.25 million relationships to write per 1000 customers; extending this to the whole database, we had to write 10 million new relationships.
But since Neo4j does not use data partitioning and, on top of that, uses a master-slave architecture, carrying out this task with Neo4j proved complicated.
Here is one more instance: Neo4j takes O(1) amount of time whenever it needs to search for some specific information. But sometimes we need to perform more than just some search operations on our database.
Turn browsing data into relevant recommendations >>>> Get in Touch
The other day, my team and I had to churn out a handful of recommendations for customers because of their recent purchases. In my company's database, there are 38000 customers and 87000 products, and the total number of interactions between customers and products is approximately 1.7 million.
When we tried to generate recommendations for each customer using a collaborative filtering method followed by the Jaccard similarity algorithm, we had to wait approximately 6 hours for the results.
I have found that whenever Neo4j must perform computationally intensive tasks on a large dataset, it typically takes considerable time.
Data alone does not drive your business. Decisions do. Speak to Our Experts to get a lowdown on how a recommender system can help your business.
Setting Up Neo4j
Now that we've figured out what Neo4j is and its pros and cons, let's get down to building one for a (hypothetical) e-commerce firm (insert the link to Part 1 here). But before creating such a recommender system, you need to download the desktop version of Neo4j.
As a beginner, one should start with "Neo4j-Desktop Explore". It includes all the built-in libraries and development tools available in the Community Edition, such as the Neo4j ETL tool.
The installation process is simple and well-explained in the Installation section.
After Installation, one can start using the Neo4j browser for querying and get familiar with the user interface.
Neo4j Browser User Interface Guide
After a certain amount of knowledge on Neo4j Desktop, one can start working with:
Neo4j Causal Clustering
It is well-suited for production environments because "Neo4j Causal Clustering" provides three main features:
Safety: Core servers provide a fault-tolerant platform for transaction processing that will remain available while a simple majority of Core Servers are functioning.
Scale: "Read Replicas" provide a massively scalable platform for graph queries, enabling enormous graph workloads to be executed across a widely distributed topology.
Causal Consistency: When invoked, a client application is guaranteed to read at least its own writes.
(Note: In a multiple-node cluster setup, only one node is assigned the "Leader" role while all other nodes are assigned the "Follower" role.)
In the Leader node, we can perform all operations (e.g., writing, reading), whereas in a follower node, a copy of the Leader node is formed, and we can only perform read/search operations.
Setting Up Configurations In Neo4j
Heap Size: The Java heap is the amount of memory allocated to applications running in the JVM, so the higher the memory allocated, the better the operation as a beginner, one can set the heap size around 1-4GB but if we talk about the production environment then one should go For higher RAM and set up the size to optimize the RAM's availability.
While working on any of the above setup types, one may notice high query time in Neo4j, so the first step is to set the initial and maximum heap sizes.
(Note: Make sure heap size is smaller than the RAM available on the system, as if you attempt to assign all RAM to the Neo4j Java process through heap allocation, you leave 0 RAM available for any other process, which may cause a memory error.)
Page Cache Size: This is used to cache the Neo4j data as stored on a disk. Ensuring that most of the graph data from disk is cached in memory will help avoid costly disk access.
Setting Up Drivers For Neo4J
We can use a specified language to automate, run queries in batches, or store multiple data tables returned by various queries into a single table, depending on the use case.
Neo4j supports a binary protocol called "Bolt". It is based on the PackStream serialization and supports the cipher type system, protocol versioning, authentication, and TLS via certificates. For Neo4j Clusters, Bolt provides intelligent client routing with load balancing and failover.
The binary protocol is enabled in Neo4j by default, so you can use any language driver that supports it.
Neo4j officially provides drivers for .NET, Java, JavaScript, Go, and Python.
You can find detailed information about the official drivers in the Neo4j Driver Manual.
For more details on the protocol implementation, see the implementer's documentation.
(Note: There are some community drivers available for other languages too, but these are not officially supported, so if possible, one should try for official drivers.)
Road Plan/How We Did It
- Import the data from the databases and create a Graph Database in Neo4j
- Develop and tune recommendation queries
- Automate and execute those queries in Python and generate high-time-consuming recommendations like collaborative filtering algorithms in a span of a day, using multi-threading to use all the nodes
- Developed a Django app and added queries that take less time in the backend itself.
- Add a pixel-tracking algorithm to the web framework to capture events and update them in real-time.
In our case of building a recommender system with Neo4j, we loaded the data from the database into CSV format using the ETL tool, then created the nodes, relationships, and properties.
Our team then began developing recommendation algorithms and writing cipher queries for them.
We learned that queries with more filters or mathematical operations tend to take longer, so we tried to use all available nodes while optimizing each node's RAM by setting the heap size in Python, and returned a list of CSVs for various recommendations.
Please do recall from the first part of this blog post that Neo4j's basic structure is built on "Nodes", "Relationships", and "Properties".
A quick recap: A Node is a data or record in a graph database, and a Relationship is an element that we use to connect two nodes of a graph. This relationship creates a pattern that associates two chunks of information together and defines a flow by assigning a direction to it.
We can add pieces of information to our nodes and relationships by assigning properties to them.
Now that we are all set, you may recall we had spoken of at least three ways we can use this Neo4j-based recommendation engine for customers of our e-commerce firm in Part One of this post:
- Recommendation based on a customer's recent history
- Recommendation based on a product similar to the product a customer was surfing for in real-time
- User-user collaborative filtering-based recommendation
Each of the above has been extensively explained in part one of this blog post in theory, so all you now have to do is use your newly built recommender system on the Neo4j platform to generate recommendations for shoppers at the fictitious e-commerce firm.
Turn browsing data into relevant recommendations >>>> Get in Touch
Use Cases of Recommendation Engine
To get a better idea of Neo4j's versatility in terms of graph database applications, let us look at its use cases:
Movie Recommendation Engine:
Express Analytics used Neo4j in making a movie recommendation engine. We used Neo4j to store and load the movie details instead of a CSV file.
Because of Neo4j's fast retrieval, the EA team used it to extract data, then trained a machine learning model on that data to speed up the extraction process.
After generating the recommendations, our team stored them in the respective movie nodes. When displaying the recommendation on the website, it used Neo4j to accelerate extraction and displayed the corresponding recommendation for the given movie.
Thus, Neo4j makes it easy to create an item-to-item collaborative model. Due to the graphical schema and the use of cipher queries on user nodes of the given schema, it is easy to find recommendations for users who watched the given movie and also watched the recommended movies.
This model was injected into the movie recommendation website (client-side). It covered movies from all over the world, with recommendations tailored to the user's taste.
Telecommunication:
In the telecommunication field, Neo4j, by its own account, is used by leading telecommunication companies, including Telenor.
Telenor manages subscriptions and user access to its business mobile subscription via its online self-service management portal.
However, due to the large number of users and their expectations for real-time responses from Telenor's online system, they found that their SQL queries were not performing sufficiently. To tackle this, the company used Neo4j to serve the extensive database.
In the end, Telenor set up its database for both corporate and residential customers because of Neo4j's high speed and ease of data access and maintenance.
Governance:
Given the sheer volume of data, any government body needs a tool to identify data connections across databases and departments, helping solve problems such as preventing crime, improving fiscal responsibility, and providing transparency in its operations.
For example, the US Army has used Neo4j to maintain its equipment database due to the sheer volume of equipment it has. Earlier, they used a mainframe-based system to manage the data, but as data volumes increased and data types changed, maintaining it became difficult.
Data was stored in Neo4j using nodes, properties, and relationships from a graph database. By querying the database, they identified which equipment needs maintenance on a priority basis.
References:
When Connected Data Matters Most
System Properties Comparison: MongoDB vs. Neo4j vs. Redis
How many nodes are required to set up a distributed system to test the CAP theorem?
