Week 9 - Import data from a relational database into Neo4j

In this lab, we'll learn how to import data from a relational database into Neo4j. By the end of this session, you should be able to design and create your graph for Project 2.

1. Dataset

Week 9 Lab Dataset (3 CSV files)

It is a star schema of a sales warehouse with one fact table and two dimension tables:

Star Schema

2. Design and ETL

2.1 Graph Database Design

Step 1: Design a graph database schema using the Arrows Tool that reflects the warehouse's star schema. Think about the general principles that

  • Each table contains a type of node.

  • Each row in a table is a node.

  • The foreign key and the primary key relations are relationships.

Your schema might look like this:

Graph Schema

Relationship labels and property names are self-designed based on the data. Feel free to use different names.

2.2. Prepare CSV Files Nodes

Let's implement it in a Python dataframe.

import pandas as pd
# load CSV files into pandas DataFrames. 
# Remember to change the data path
sales_table = pd.read_csv('./data/sales.csv')
products_table = pd.read_csv('./data/products.csv')
shops_table = pd.read_csv('./data/shops.csv')

# use ".head()" to review your tables
# sales_table.head()
# products_table.head()
# shops_table.head()

After comparing the graph schema (Figure 1) with the data, we can find that "shop" nodes come from the "shops_table" and "product" nodes come from the "products_table". However, the "sale" nodes require some ETL processing:

# Keeping only specified columns using loc method
sales_node_table = sales_table.loc[:, ['Sale ID', 'Date', 'Quantity', 'Unit price']]
# export to csv
sales_node_table.to_csv('./data/sales_node.csv', index=False)

Now, we have all CSV files for nodes. Let's import them into Neo4j.

2.3. Import CSV Files

The screenshots below are from Neo4j Desktop. If you're using the cloud version, the steps are similar.

2.3.1. Create a project

2.3.2. Local DBMS Configuration

Once the project is created, rename it to "week9" or other names of your choice. Add a local DBMS and configure the name and password:

2.3.3. Import CSV Files

2.3.4. Start the Database

After clicking "import", a folder should pop up. Copy your "sales_node.csv", "products.csv", and "shops.csv" into this folder. Close the pop-up folder and then click the "start" button:

2.3.5. Open Database

Once the database status changes to "ACTIVE", click "Open".

2.3.6. Create Nodes

Let's start by creating "Shop" nodes. Write the code below and click the "run" button on the right. The variable name "row" represents each row in the imported CSV file, in this case, "shops.csv". You can rename "row" to another variable name if you prefer. Use "row.HEADER_NAME" to refer to each header. For instance, row: "Shop ID" refers to the "Shop ID" column in "shops.csv". The HEADER_NAME must exactly match the CSV header, including case sensitivity. The properties before the ":" are the ones you've named in your schema design (Figure 1).

LOAD CSV WITH HEADERS FROM 'file:///shops.csv' AS row
CREATE (d:Shop {
    shopID: row.`Shop ID`,
    city: row.City,
    stateOrRegion: row.`State or region`,
    country: row.Country,
    shopSize: row.`Shop size`
})

Click on the icon in red, and you should be able to see the "Shop" under "Node labels". Click on "Shop" under "Node labels" to explore further.

Follow a similar process for creating "Product" and "Sale" nodes:

LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS row
CREATE (d:Product {
    productID: row.`Product`,
    category: row.Category,
    subcategory: row.`Sub category`,
    productSize: row.Size,
    purchasePrice: row.`Purchase price`,
    color: row.Color,
    brand:row.Brand
})
LOAD CSV WITH HEADERS FROM 'file:///sales_node.csv' AS row
CREATE (d:Sale {
    saleID: row.`Sale ID`,
    date: row.Date,
    quantity: row.Quantity,
    unitPrice:row.`Unit price`
})

Your turn: Assign different colours to each label. You can refer to the Week 8 lab sheet for guidance. For example:

2.4. Prepare CSV files for Relationships

2.4.1. Create Relationships

Based on the graph schema (Figure 1), we would like to create the relationship "CONTAINS" from "Sale" to "Product", "SOLD_IN" from "Sale" to "Shop":

# as you can see, the relationship between "Sale" and "Product" can obtain from "sales_table"
contain_table = sales_table.loc[:, ['Sale ID', 'Product']]
contain_table.to_csv('./data/rel_contains.csv', index=False)
# as you can see, the relationship between "Sale" and "Shop" can also obtain from "sales_table"
sold_in_table = sales_table.loc[:, ['Sale ID', 'Shop']]
sold_in_table.to_csv('./data/rel_sold_in.csv', index=False)

2.4.2. Import CSV Files and Create Relationships

Copy "rel_sold_in.csv" and "rel_contains.csv" into the "import" folder (as in 2.3.3) and run the following code. "->" indicates the direction of the relationship. If you want to create an undirected relationship, use CREATE (s)-[:CONTAINS]-(p).

LOAD CSV WITH HEADERS FROM 'file:///rel_contains.csv' AS row
MATCH (s:Sale {saleID: row.`Sale ID`}),
(p:Product { productID: row.Product})
CREATE (s)-[:CONTAINS]->(p)
LOAD CSV WITH HEADERS FROM 'file:///rel_sold_in.csv' AS row
MATCH (s:Sale {saleID: row.`Sale ID`}),
(p:Shop { shopID: row.Shop})
CREATE (s)-[:SOLD_IN]->(p)

Now that you've created your graph database from a relation database, you might wonder if you can have additional node labels. For instance, separating "country" from "shop" or "category" and "subcategory" from "product". The answer is "Yes". There's no fixed answer in design; it depends on how you intend to use your graph.

Next week, we'll explore querying graph databases using Cypher.

Last updated