Week 12 - Graph Data Science
Last updated
Last updated
In the lecture, we saw the basic pseudocode for BFS and DFS, which might have some limitations, such as many duplicated searches and the inability to handle infinite loops. There are many variations of search algorithms.
Your Turn:
What is the difference between the above DFS and the DFS in lecture notes?
Test the above DFS algorithm on the graph provided in Slide 45 (Example: DFS):
Secnoria: Siri is currently in Arad and she wants to drive to Bucharest.
Your Turn:
What is the solution for BFS?
What is the solution for A*? The heuristic function is the straight-line distance to Bucharest.
Now, let's play with the search functions in Neo4j. In the lecture, the demo graph version is 5.3.0 with Graph Data Science Library is 2.4.6.
The dataset we are using is a database of countries. Each country has a latitude
and longitude
, and are connected by a "road" with a distance property
(just like the transport graph from the lectures, though this one is ROAD
rather than EROAD
and is much larger).
countries.csv
: Each row represents a country, with a country_code
, latitude
, longitude
, and name
.
roads.csv
: Each row connects country_1
with country_2
and stores the distance
between them (in km).
The dataset is sourced from both Google and Wikipedia:
Please note the data may not be completely accurate and is just for demonstration purposes.
First, create a new database for this week in Neo4j. Once you've created it, you'll need to activate the Graph Data Science (GDS) plugin. To do this, click on the database in the list of databases, and in the panel that appears on the right click on "Plugins". The Graph Data Science library should be in the plugins list - click on it, then click "Install".
Before we run any graph data science algorithms, we need to load our data into Neo4j. Remember, you'll need to place the csv files into the import
directory of your Neo4j database.
Next, you'll need to write LOAD CSV
statements to import both countries and roads into your database. Here is one for countries to get you started:
Note how we have converted row.latitude
and row.longitude
to floats - this is important, otherwise Neo4j will treat them as strings.
Your turn: write a LOAD CSV
statement that imports the roads into your database. Remember, the Countries have already been created, so after LOAD CSV
you'll need to run a MATCH
statement to find the countries listed in each row of roads.csv
, and only CREATE
a road between those two countries. Don't forget the distance
property, which should also be a float.
Before we can run any GDS algorithms, we must first project our graph for the Graph Data Science library. Here is the syntax for the projection:
The first argument, "Countries", is the name of our GDS graph.
The second argument, "Country", is the name of the node label types we want to project.
The third argument specifies options for the relationships. In this case, we want the ROAD
relationship to be undirected (as roads are two-way).
The fourth argument specifies the properties that we want to include in our GDS graph. We need the distance
property from the road relationships, and the latitude
and longitude
from the countries, for the pathfinding algorithms.
Note: If you ever need to delete your projection and start over, you can run:
Let's try running the Breadth First Search and Depth First Search algorithms on our data.
Here is an example BFS query:
Try running this query in Neo4j. If the resulting graph visualisation looks a bit messy because of the roads between countries, uncheck the connect result nodes
option in the Options list and run the query again. You should see multiple Country nodes connected by an orange relationship called NEXT
.
Your turn: change the query to be a depth-first search (DFS) and run it. Are the results the same, or different - and why?
Your turn: try running the A* source-target shortest path algorithm to find the shortest path between Spain and India. You can see the syntax in the lecture slides or read the following documentation page:
Now try deleting the relationshipWeightProperty: "distance"
line in the query and run the algorithm again. Has the shortest path changed now that the algorithm is fully relying on a heuristic?
The final algorithm we'll be running today is Minimum Spanning Tree (MST), which starts from a given node and finds all its reachable nodes and the set of relationships that connect the nodes together with the minimum possible weight.
Here's the syntax for MST taken from Neo4j documentation:
Your turn: Create relationships of the type MINST
between Canada and every country reachable from Canada. Write a Cypher query to display the minimum spanning tree. Hint: the relationship type is MINST
, not ROAD
!.
Country data was sourced from . Some minor preprocessing was done on some of the country names to convert them to the same names in Wikipedia.
Roads data was generated via a Python script. To do this we obtained a list of bordering countries from , did some cleaning/preprocessing etc, and used the geopy
Python library to calculate the geodesic distance between each bordering country.