Week 12 - Graph Data Science

We recommend attempting to install Neo4j Desktop to do this lab if possible, as it makes things a bit easier. If you are not able to use Neo4j Desktop you can try

https://sandbox.neo4j.com/

and create a Graph Data Science sandbox. Please be aware that this sandbox will expire in 3 days after creation.

A. Search Algorithms

Scenario: Siri is currently in Arad, and she wants to drive to Bucharest.

B. Neo4j

Now, let's play with the search functions in Neo4j. In the lecture, the demo graph version is 5.3.0 with Graph Data Science Library is 2.4.6.

Dataset

The dataset we are using is a database of countries. Each country has a latitude and longitude, and are connected by a "road" with a distance property (just like the transport graph from the lectures, though this one is ROAD rather than EROAD and is much larger).

  • countries.csv: Each row represents a country, with a country_code, latitude, longitude, and name.

  • roads.csv: Each row connects country_1 with country_2 and stores the distance between them (in km).

The dataset is sourced from both Google and Wikipedia:

  • Country data was sourced from Google. Some minor preprocessing was done on some of the country names to convert them to the same names in Wikipedia.

  • Roads data was generated via a Python script. To do this we obtained a list of bordering countries from Wikipedia, did some cleaning/preprocessing etc, and used the geopy Python library to calculate the geodesic distance between each bordering country.

Please note the data may not be completely accurate and is just for demonstration purposes.

Task 1. Creating a graph DB and enabling Graph Data Science

First, create a new database for this week in Neo4j. Once you've created it, you'll need to activate the Graph Data Science (GDS) plugin. To do this, click on the database in the list of databases, and in the panel that appears on the right click on "Plugins". The Graph Data Science library should be in the plugins list - click on it, then click "Install".

Task 2. Exploring the data and loading it into Neo4j

Before we run any graph data science algorithms, we need to load our data into Neo4j. Remember, you'll need to place the csv files into the import directory of your Neo4j database.

Next, you'll need to write LOAD CSV statements to import both countries and roads into your database. Here is one for countries to get you started:

// LOAD DATA FROM CSV
LOAD CSV WITH HEADERS FROM "file:///countries.csv" AS row
CREATE (c:Country)
SET c.country = row.country,
    c.name = row.name,
    c.latitude = toFloat(row.latitude),
    c.longitude = toFloat(row.longitude)

Note how we have converted row.latitude and row.longitude to floats - this is important, otherwise Neo4j will treat them as strings.

Task 3. Projecting our graph for GDS

Before we can run any GDS algorithms, we must first project our graph for the Graph Data Science library. Here is the syntax for the projection:

// PROJECT GRAPH FOR GDS
CALL gds.graph.project("Countries", "Country",
    { ROAD: {orientation: "UNDIRECTED"} },
    { relationshipProperties: "distance",
    nodeProperties: ["latitude", "longitude"] }
    )
  • The first argument, "Countries", is the name of our GDS graph.

  • The second argument, "Country", is the name of the node label types we want to project.

  • The third argument specifies options for the relationships. In this case, we want the ROAD relationship to be undirected (as roads are two-way).

  • The fourth argument specifies the properties that we want to include in our GDS graph. We need the distance property from the road relationships, and the latitude and longitude from the countries, for the pathfinding algorithms.

Note: If you ever need to delete your projection and start over, you can run:

// DELETE GRAPH
CALL gds.graph.drop("Countries")

Task 4: Running Graph Data Science Algorithms

Let's try running the Breadth First Search and Depth First Search algorithms on our data.

Here is an example BFS query:

// BFS
MATCH (source: Country {name: "Spain"})
CALL gds.bfs.stream("Countries", {
    sourceNode: source
    })
YIELD path
RETURN path

Try running this query in Neo4j. If the resulting graph visualisation looks a bit messy because of the roads between countries, uncheck the connect result nodes option in the Options list and run the query again. You should see multiple Country nodes connected by an orange relationship called NEXT.

4.2 A* Source-target Shortest Path

Now try deleting the relationshipWeightProperty: "distance" line in the query and run the algorithm again. Has the shortest path changed now that the algorithm is fully relying on a heuristic?

4.3 Minimum Spanning Tree

The final algorithm we'll be running today is Minimum Spanning Tree (MST), which starts from a given node and finds all its reachable nodes and the set of relationships that connect the nodes together with the minimum possible weight.

Here's the syntax for MST taken from Neo4j documentation:

// Mimum Spanning Tree
MATCH (n:Place{id: 'D'})
CALL gds.beta.spanningTree.write('graph', {
  startNode: id(n),
  relationshipWeightProperty: 'cost',
  writeRelationshipType: 'MINST',
  writeProperty: 'writeCost'
})
YIELD preProcessingMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN preProcessingMillis,computeMillis, writeMillis, effectiveNodeCount;

Last updated