Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Graph Data Science with Neo4j
Graph Data Science with Neo4j

Graph Data Science with Neo4j: Learn how to use Neo4j 5 with Graph Data Science library 2.0 and its Python driver for your project

Arrow left icon
Profile Icon Scifo Profile Icon Estelle Scifo
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (6 Ratings)
Paperback Jan 2023 288 pages 1st Edition
eBook
€19.99 €28.99
Paperback
€35.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Scifo Profile Icon Estelle Scifo
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (6 Ratings)
Paperback Jan 2023 288 pages 1st Edition
eBook
€19.99 €28.99
Paperback
€35.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€19.99 €28.99
Paperback
€35.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Graph Data Science with Neo4j

Introducing and Installing Neo4j

Graph databases in general, and Neo4j in particular, have gained increasing interest in the past few years. They provide a natural way of modeling entities and relationships and take into account observation context, which is often crucial to extract the most out of your data. Among the different graph database vendors, Neo4j has become one of the most popular for both data storage and analytics. A lot of tools have been developed by the company itself or the community to make the whole ecosystem consistent and easy to use: from storage to querying, to visualization to graph data science. As you will see through this book, there is a well-integrated application or plugin for each of these topics.

In this chapter, you will get to know what Neo4j is, positioning it in the broad context of databases. We will also introduce the aforementioned plugins that are used for graph data science.

Finally, you will set up your first Neo4j instance locally if you haven’t done so already and run your first Cypher queries to populate the database with some data and retrieve it.

In this chapter, we’re going to cover the following main topics:

  • What is a graph database?
  • Finding or creating a graph database
  • Neo4j in the graph databases landscape
  • Setting up Neo4j
  • Inserting data into Neo4j with Cypher, the Neo4j query language
  • Extracting data from Neo4j with Cypher pattern matching

Technical requirements

To follow this chapter well, you will need access to the following resources:

What is a graph database?

Before we get our hands dirty and start playing with Neo4j, it is important to understand what Neo4j is and how different it is from the data storage engine you are used to. In this section, we are going to discuss (quickly) the different types of databases you can find today, and why graph databases are so interesting and popular both for developers and data professionals.

Databases

Databases make up an important part of computer science. Discussing the evolution and state-of-the-art areas of the different implementations in detail would require several books like this one – fortunately, this is not a requirement to use such systems effectively. However, it is important to be aware of the existing tools related to data storage and how they differ from each other, to be able to choose the right tool for the right task. The fact that, after reading this book, you’ll be able to use graph databases and Neo4j in your data science project doesn’t mean you will have to use it every single time you start a new project, whatever the context is. Sometimes, it won’t be suitable; this introduction will explain why.

A database, in the context of computing, is a system that allows you to store and query data on a computer, phone, or, more generally, any electronic device.

As developers or data scientists of the 2020s, we have mainly faced two kinds of databases:

  • Relational databases (SQL) such as MySQL or PostgreSQL. These store data as records in tables whose columns are attributes or fields and whose rows represent each entity. They have a predefined schema, defining how data is organized and the type of each field. Relationships between entities in this representation are modeled by foreign keys (requiring unique identifiers). When the relationship is more complex, such as when attributes are required or when we can have many relationships between the same objects, an intermediate junction (join) table is required.
  • NoSQL databases contain many different types of databases:
    • Key-value stores such as Redis or Riak. A key-value (KV) store, as the name suggests, is a simple lookup database where the key is usually a string, and the value can be a more complex object that can’t be used to filter the query – it can only be retrieved. They are known to be very efficient for caching in a web context, where the key is the page URL and the value is the HTML content of the page, which is dynamically generated. KV stores can also be used to model graphs when building a native graph engine is not an option. You can see KV stores in action in the following projects:
    • Document-oriented databases such as MongoDB or CouchDB. These are useful for storing schema-less documents (usually JSON (or a derivative) objects). They are much more flexible compared to relational databases, since each document may have different fields. However, relationships are harder to model, and such databases rely a lot on nested JSON and information duplication instead of joining multiple tables.

The preceding list is non-exhaustive; other types of data stores have been created over time and abandoned or were born in the past years, so we’ll need to wait to see how useful they can be. We can mention, for instance, vector databases, such as Weaviate, which are used to store data with their vector representations to ease searching in the vector space, with many applications in machine learning once a vector representation (embedding) of an observation has been computed.

Graph databases can also be classified as NoSQL databases. They bring another approach to the data storage landscape, especially in the data model phase.

Graph database

In the previous section, we talked about databases. Before discussing graph databases, let’s introduce the concept of graphs.

A graph is a mathematical object defined by the following:

  • A set of vertices or nodes (the dots)
  • A set of edges (the connections between these dots)

The following figure shows several examples of graphs, big and small:

Figure 1.1 – Representations of some graphs

Figure 1.1 – Representations of some graphs

As you can see, there’s a Road network (in Europe), a Computer network, and a Social network. But in practice, far more objects can be seen as graphs:

  • Time series: Each observation is connected to the next one
  • Images: Each pixel is linked to its eight neighbors (see the bottom-right picture in Figure 1.1)
  • Texts: Here, each word is connected to its surrounding words or a more complex mapping, depending on its meaning (see the following figure):
Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques

Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques

A graph can be seen as a generalization of these static representations, where links can be created with fewer constraints.

Another advantage of graphs is that they can be easily traversed, going from one node to another by following edges. They have been used for representing networks for a long time – road networks or communication infrastructure, for instance. The concept of a path, especially the shortest path in a graph, is a long-studied field. But the analysis of graphs doesn’t stop here – much more information can be extracted from carefully analyzing a network, such as its structure (are there groups of nodes disconnected from the rest of the graph? Are groups of nodes more tied to each other than to other groups?) and node ranking (node importance). We will discuss these algorithms in more detail in Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset.

So, we know what a database is and what a graph is. Now comes the natural question: what is a graph database? The answer is quite simple: in a graph database, data is saved into nodes, which can be connected through edges to model relationships between them.

At this stage, you may be wondering: ok, but where can I find graph data? While we are used to CSV or JSON datasets, graph formats are not yet common and it might be misleading to some of you. If you do not have graph data, why would you need a graph database? There are two possible answers to this question, both of which we are going to discuss.

Finding or creating a graph database

Data scientists know how to find or generate datasets that fit their needs. Randomly generating a variable distribution while following some probabilistic law is one of the first things you’ll learn in a statistics course. Similarly, graph datasets can be randomly generated, following some rules. However, this book is not a graph theory book, so we are not going to dig into these details here. Just be aware that this can be done. Please refer to the references in the Further reading section to learn more.

Regarding existing datasets, some of them are very popular and data scientists know about them because they have used them while learning data science and/or because they are the topic of well-known Kaggle competitions. Think, for instance, about the Titanic or house price datasets. Other datasets are also used for model benchmarking, such as the MNIST or ImageNet datasets in computer vision tasks.

The same holds for graph data science, where some datasets are very common for teaching or benchmarking purposes. If you investigate graph theory, you will read about the Zachary’s karate club (ZKC) dataset, which is probably one of the most famous graph datasets out there (side note: there is even a ZKC trophy, which is awarded to the first person in a graph conference that mentions this dataset). The ZKC dataset is very simple (30 nodes, as we’ll see in Chapter 3, Characterizing a Graph Dataset, and Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, on how to characterize a graph dataset), but bigger and more complex datasets are also available.

There are websites referencing graph datasets, which can be used for benchmarking in a research context or educational purpose, such as this book. Two of the most popular ones are the following:

  • The Stanford Network Analysis Project (SNAP) (https://snap.stanford.edu/data/index.html) lists different types of networks in different categories (social networks, citation networks, and so on)
  • The Network Repository Project, via its website at https://networkrepository.com/index.php, provides hundreds of graph datasets from real-world examples, classified into categories (for example, biology, economics, recommendations, road, and so on)

If you browse these websites and start downloading some of the files, you’ll notice the data comes in unfamiliar formats. We’re going to list some of them next.

A note about the graph dataset’s format

The datasets we are used to are mainly exchanged as CSV or JSON files. To represent a graph, with nodes on one side and edges on the other, several specific formats have been imagined.

The main data formats that are used to save graph data as text files are the following:

  • Edge list: This is a text file where each row contains an edge definition. For instance, a graph with three nodes (A, B, C) and two edges (A-B and C-A) is defined by the following edgelist file:
    A B
    C A
  • Matrix Market (with the .mtx extension): This format is an extension of the previous one. It is quite frequent on the network repository website.
  • Adjacency matrix: The adjacency matrix is an NxN matrix (where N is the number of nodes in the graph) where the ij element is 1 if nodes i and j are connected through an edge and 0 otherwise. The adjacency matrix of the simple graph with three nodes and two edges is a 3x3 matrix, as shown in the following code block. I have explicitly displayed the row and column names only for convenience, to help you identify what i and j are:
      A B C
    A 0 1 0
    B 0 0 0
    C 1 0 0

Note

The adjacency matrix is one way to vectorize a graph. We’ll come back to this topic in Chapter 7, Automatically Extracting Features with Graph Embeddings for Machine Learning.

  • GraphML: Derived from XML, the GraphML format is much more verbose but lets us define more complex graphs, especially those where nodes and/or edges carry properties. The following example uses the preceding graph but adds a name property to nodes and a length property to edges:
    <?xml version='1.0' encoding='utf-8'?>
    <graphml xmlns="http://graphml.graphdrawing.org/xmlns"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
    >
        <!-- DEFINING PROPERTY NAME WITH TYPE AND ID -->
        <key attr.name="name" attr.type="string" for="node" id="d1"/>
        <key attr.name="length" attr.type="double" for="edge" id="d2"/>
        <graph edgedefault="directed">
           <!-- DEFINING NODES -->
           <node id="A">
                 <!-- SETTING NODE PROPERTY -->
                <data key="d1">"Point A"</data>
            </node>
            <node id="B">
                <data key="d1">"Point B"</data>
            </node>
            <node id="C">
                <data key="d1">"Point C"</data>
            </node>
            <!-- DEFINING EDGES
            with source and target nodes and properties
        -->
            <edge id="AB" source="A" target="B">
                <data key="d2">123.45</data>
            </edge>
            <edge id="CA" source="C" target="A">
                <data key="d2">56.78</data>
            </edge>
        </graph>
    </graphml>

If you find a dataset already formatted as a graph, it is likely to be using one of the preceding formats. However, most of the time, you will want to use your own data, which is not yet in graph format – it might be stored in the previously described databases or CSV or JSON files. If that is the case, then the next section is for you! There, you will learn how to transform your data into a graph.

Modeling your data as a graph

The second answer to the main question in this section is: your data is probably a graph, without you being aware of it yet. We will elaborate on this topic in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph), but let me give you a quick overview.

Let’s take the example of an e-commerce website, which has customers (users) and products. As in every e-commerce website, users can place orders to buy some products. In the relational world, the data schema that’s traditionally used to represent such a scenario is represented on the left-hand side of the following screenshot:

Figure 1.3 – Modeling e-commerce data as a graph

Figure 1.3 – Modeling e-commerce data as a graph

The relational data model works as follows:

  • A table is created to store users, with a unique identifier (id) and a username (apart from security and personal information required for such a website, you can easily imagine how to add columns to this table).
  • Another table contains the data about the available products.
  • Each time a customer places an order, a new row is added to an order table, referencing the user by its ID (a foreign key with a one-to-many relationship, where a user can place many orders).
  • To remember which products were part of which orders, a many-to-many relationship is created (an order contains many products and a product is part of many orders). We usually create a relationship table, linking orders to products (the order product table, in our example).

Note

Please refer to the colored version of the preceding figure, which can be found in the graphics bundle link provided in the Preface, for a better understanding of the correspondence between the two sides of the figure.

In a graph database, all the _id columns are replaced by actual relationships, which are real entities with graph databases, not just conceptual ones like in the relational model. You can also get rid of the order product table since information specific to a product in a given order such as the ordered quantity can be stored directly in the relationship between the order and the product node. The data model is much more natural and easier to document and present to other people on your team.

Now that we have a better understanding of what a graph database is, let’s explore the different implementations out there. Like the other types of databases, there is no single implementation for graph databases, and several projects provide graph database functionalities.

In the next section, we are going to discuss some of the differences between them, and where Neo4j is positioned in this technology landscape.

Neo4j in the graph databases landscape

Even when restricting the scope to graph databases, there are still different ways to envision such data stores:

  • Resource description framework (RDF): Each record is a triplet of the Subject Predicate Object type. This is a complex vocabulary that expresses a relationship of a certain type (the predicate) between a subject and an object; for instance:
    Alice(Subject) KNOWS(Predicate) Bob(Object)

Very famous knowledge bases such as DBedia and Wikidata use the RDF format. We will talk about this a bit more in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph).

  • Labeled-property graph (LPG): A labeled-property graph contains nodes and relationships. Both of these entities can be labeled (for instance, Alice and Bob are nodes with the Person label, and the relationship between them has the KNOWS label) and have properties (people have names; an acquaintance relationship can contain the date when both people first met as a property).

Neo4j is a labeled-property graph. And even there, like MySQL, PostgreSQL, and Microsoft SQL Server are all relational databases, you will find different vendors proposing LPG graph databases. They differ in many aspects:

  • Whether they use a native graph engine or not: As we discussed earlier, it is possible to use a KV store or even a SQL database to store graph data. In this case, we’re talking about non-native storage engines since the storage does not reflect the graphical nature of the data.
  • The query language: Unlike SQL, the query language to deal with graph data has not yet been standardized, even if there is an ongoing effort being led by the GQL group (see, for instance, https://gql.today/). Neo4j uses Cypher, a declarative query language developed by the company in 2011 and then open-sourced in the openCypher project, allowing other databases to use the same language (see, for instance, RedisGraph or Amazon Neptune). Other vendors have created their own languages (AQL for ArangoDB or CQL for TigerGraph, for instance). To me, this is a key point to take into account since the learning curve can be very different from one language to another. Cypher has the advantage of being very intuitive and a few minutes are enough to write your own queries without much effort.
  • Their (integrated or not) support for graph analytics and data science.

A note about performances

Almost every vendor claims to be the best one, at least in some aspects. This book won’t create another debate about that. The best option, if performances are crucial for your application, is to test the candidates with a scenario close to your final goal in terms of data volume and the type of queries/analysis.

Neo4j ecosystem

The Neo4j database is already very helpful by itself, but the amount of extensions, libraries, and applications related to it makes it the most complete solution. In addition, it has a very active community of members always keen to help each other, which is one of the reasons to choose it.

The core Neo4j database capabilities can be extended thanks to some plugins. Awesome Procedures on Cypher (APOC), a common Neo4j extension, contains some procedures that can extend the database and Cypher capabilities. We will use it later in this book to load JSON data.

The main plugin we will explore in this book is the Graph Data Science Library. Its predecessor, the Graph Algorithm Library, was first released in 2018 by the Neo4j lab team. It was quickly replaced by the Graph Data Science Library, a fully production-ready plugin, with improved performance. Algorithms are improved and added regularly. Version 2.0, released in 2021, takes graph data science even further, allowing us to train models and build analysis pipelines directly from the library. It also comes with a handy Python client, which is very convenient for including graph algorithms into your usual machine learning processes, whether you use scikit-learn or other machine learning libraries such as TensorFlow or PyTorch.

Besides the plugins, there are also lots of applications out there to help us deal with Neo4j and explore the data it contains. The first application we will use is Neo4j Desktop, which lets us manage several Neo4j databases. Continue reading to learn how to use it. Neo4j Desktop also lets you manage your installed plugins and applications.

Applications installed into Neo4j Desktop are granted access to your active database. While reading this book, you will use the following:

  • Neo4j Browser: A simple but powerful application that lets you write Cypher queries and visualize the result as a graph, table, or JSON:
Figure 1.4 – Neo4j Browser

Figure 1.4 – Neo4j Browser

  • Neo4j Bloom: A graph visualization application in which you can customize node styles (size, color, and so on) based on their labels and/or properties:
Figure 1.5 – Neo4j Bloom

Figure 1.5 – Neo4j Bloom

  • Neodash: This is a dashboard application that allows us to draw plots from the data stored in Neo4j, without having to extract this data into a DataFrame first. Plots can be organized into nice dashboards that can be shared with other users:
Figure 1.6 – Neodash

Figure 1.6 – Neodash

This list of applications is non-exhaustive. You can find out more here: https://install.graphapp.io/.

Good to know

You can create your own graph application to be run within Neo4j Desktop. This is why there are so many diverse applications, some of which are being developed by community members or Neo4j partners.

This section described Neo4j as a database and the various extensions that can be added to it to make it more powerful. Now, it is time to start using it. In the following section, you are going to install Neo4j locally on our computer so that you can run the code examples provided in this book (which you are highly encouraged to do!).

Setting up Neo4j

There are several ways to use Neo4j:

  • Through short-lived time sandboxes in the cloud, which is perfect for experimenting
  • Locally, with Neo4j Desktop
  • Locally, with Neo4j binaries
  • Locally, with Docker
  • In the cloud, with Neo4j Aura (free plan available) or Neo4j AuraDS

For the scope of this book, we will use the Neo4j Desktop option, since this application takes care of many things for us and we do not want to go into server management at this stage.

Downloading and starting Neo4j Desktop

The easiest way to use Neo4j on your local computer when you are in the experimentation phase, is to use the Neo4j Desktop application, which is available on Windows, Mac, and Linux OS. This user interface lets you create Neo4j databases, which are organized into Projects, manage the installed plugins and applications, and update the DB configuration – among other things.

Installing it is super easy: go to the Neo4j download center and follow the instructions. We recap the steps here, with screenshots to guide you through the process:

  1. Visit the Neo4j download center at https://neo4j.com/download-center/. At the time of writing, the website looks like this:
Figure 1.7 – Neo4j Download Center

Figure 1.7 – Neo4j Download Center

  1. Click the Download Neo4j Desktop button at the top of the page.
  2. Fill in the form that’s asking for some information about yourself (name, email, company, and so on).
  3. Click Download Desktop.
  4. Save the activation key that is displayed on the next page. It will look something like this (this one won’t work, so don’t copy it!):
    eyJhbGciOiJQUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6InN0ZWxsYTBvdWhAZ21haWwuY29tIiwibWl4cGFuZWxJZ CI6Imdvb2dsZS1vYXV0a
    ...
    ...

The following steps depend on your operating system:

  • On Windows, locate the installer, double-click on it, and follow the steps provided.
  • On Mac, just click on the downloaded file.
  • On Linux, you’ll have to make the downloaded file executable before running it. More instructions will be provided next.

For Linux users, here is how to proceed:

  1. When the download is over (this can take some time since the file is a few hundred MBs), open a Terminal and go to your download directory:
    # update path depending on your system
    $ cd Downloads/
  2. Then, run the following command, which will extract the version and architecture name from the AppImage file you’ve just downloaded:
    $ DESKTOP_VERSION=`ls -tr  neo4j-desktop*.AppImage | tail -1 | grep -Po "(?<=neo4j-desktop-)[^AppImage]+"
    $ echo ${DESKTOP_VERSION}
  3. If the preceding echo command shows something like 1.4.11-x86_64., you’re good to go. Alternatively, you can identify the pattern yourself and create the variable, like so:
    $ DESKTOP_VERSION=1.4.11-x86_64.  # include the final dot
  4. Then, you need to make the file executable with chmod and run the application:
    # make file executable:
    $ chmod +x neo4j-desktop-${DESKTOP_VERSION}AppImage
    # run the application:
    $ ./neo4j-desktop-${DESKTOP_VERSION}AppImage

The last command in the preceding code snippet starts the Neo4j Desktop application. The first time you run the application, it will ask you for the activation key you saved when downloading the executable. And that’s it – the application will be running, which means we can start creating Neo4j databases and interact with them.

Creating our first Neo4j database

Creating a new database with Neo4j desktop is quite straightforward:

  1. Start the Neo4j Desktop application.
  2. Click on the Add button in the top-right corner of the screen.
  3. Select Local DBMS.

This process is illustrated in the following screenshot:

Figure 1.8 – Adding a new database with Neo4j Desktop

Figure 1.8 – Adding a new database with Neo4j Desktop

  1. The next step is to choose a name, a password, and the version of your database.

Note

Save the password in a safe place; you’ll need to provide it to drivers and applications when connecting to this database.

  1. It is good practice to always choose the latest available version; Neo4j Desktop takes care of checking which version it is. The following screenshot shows this step:
Figure 1.9 – Choosing a name, password, and version for your new database

Figure 1.9 – Choosing a name, password, and version for your new database

  1. Next, just click Create, and wait for the database to be created. If the latest Neo4j version needs to be downloaded, it can take some time, depending on your connection.
  2. Finally, you can start your database by clicking on the Start button that appears when you hover your new database name, as shown in the following screenshot:
Figure 1.10 – Starting your newly created database

Figure 1.10 – Starting your newly created database

Note

You can’t have two databases running at the same time. If you start a new database while another is still running, the previous one must be stopped before the new one can be started.

You now have Neo4j Desktop installed and a running instance of Neo4j on your local computer. At this point, you are ready to start playing with graph data. Before moving on, let me introduce Neo4j Aura, which is an alternative way to quickly get started with Neo4j.

Creating a database in the cloud – Neo4j Aura

Neo4j also has a DB-as-a-service component called Aura. It lets you create a Neo4j database hosted in the cloud (either on Google Cloud Platform or Amazon Web Services, your choice) and is fully managed – there’s no need to worry about updates anymore. This service is entirely free up to a certain database size (50k nodes and 150k relationships), which makes it sufficient for experimenting with it. To create a database in Neo4j Aura, visit https://neo4j.com/cloud/platform/aura-graph-database/.

The following screenshot shows an example of a Neo4j database running in the cloud thanks to the Aura service:

Figure 1.11 – Neo4j Aura dashboard with a free-tier instance

Figure 1.11 – Neo4j Aura dashboard with a free-tier instance

Clicking Explore opens Neo4j Bloom, which we will cover in Chapter 3, Characterizing a Graph Dataset, while clicking Query starts Neo4j Browser in a new tab. You’ll be requested to enter the connection information for your database. The URL can be found in the previous screenshot – the username and password are the ones you set when creating the instance.

In the rest of this book, examples will be provided using a local database managed with the Neo4j Desktop application, but you are free to use whatever technique you prefer. However, note that some minor changes are to be expected if you choose something different, such as directory location or plugin installation method. In the latter case, always refer to the plugin or application documentation to find out the proper instructions.

Now that our first database is ready, it is time to insert some data into it. For this, we will use our first Cypher queries.

Inserting data into Neo4j with Cypher, the Neo4j query language

Cypher, as we discussed at the beginning of this chapter, is the query language developed by Neo4j. It is used by other graph database vendors, such as Redis Graph.

First, let’s create some nodes in our newly created database.

To do so, open Neo4j Browser by clicking on the Open button next to your database and selecting Neo4j Browser:

Figure 1.12 – Start the Neo4j Browser application from Neo4j Desktop

Figure 1.12 – Start the Neo4j Browser application from Neo4j Desktop

From there, you can start and write Cypher queries in the upper text area.

Let’s start and create some nodes with the following Cypher query:

CREATE (:User {name: "Alice", birthPlace: "Paris"})
CREATE (:User {name: "Bob", birthPlace: "London"})
CREATE (:User {name: "Carol", birthPlace: "London"})
CREATE (:User {name: "Dave", birthPlace: "London"})
CREATE (:User {name: "Eve", birthPlace: "Rome"})

Before running the query, let me detail its syntax:

Figure 1.13 – Anatomy of a node creation Cypher statement

Figure 1.13 – Anatomy of a node creation Cypher statement

Note that all of these components except for the parentheses are optional. You can create a node with no label and no properties with CREATE (), even if creating an empty record wouldn’t be really useful for data storage purposes.

Tips

You can copy and paste the preceding query and execute it as-is; multiple line queries are allowed by default in Neo4j Browser.

If the upper text area is not large enough, press the Esc key to maximize it.

Now that we’ve created some nodes and since we are dealing with a graph database, it is time to learn how to connect these nodes by creating edges, or, in Neo4j language, relationships.

The following code snippet starts by fetching the start and end nodes (Alice and Bob), then creates a relationship between them. The created relationship is of the KNOWS type and carries one property (the date Alice and Bob met):

MATCH (alice:User {name: "Alice"})
MATCH (bob:User {name: "Bob"})
CREATE (alice)-[:KNOWS {since: "2022-12-01"}]->(bob)

We could have also put all our CREATE statements into one big query, for instance, by adding aliases to the created nodes:

CREATE (alice:User {name: "Alice", birthPlace: "Paris"})
CREATE (bob:User {name: "Bob", birthPlace: "London"})
CREATE (alice)-[:KNOWS {since: "2022-12-01"}]->(bob)

Note

In Neo4j, relationships are directed, meaning you have to specify a direction when creating them, which we can do thanks to the > symbol. However, Cypher lets you select data regardless of the relationship’s direction. We’ll discuss this when appropriate in the subsequent chapters.

Inserting data into the database is one thing, but without the ability to query and retrieve this data, databases would be useless. In the next section, we are going to use Cypher’s powerful pattern matching to read data from Neo4j.

Extracting data from Neo4j with Cypher pattern matching

So far, we have put some data in Neo4j and explored it with Neo4j Browser. But unsurprisingly, Cypher also lets you select and return data programmatically. This is what is called pattern matching in the context of graphs.

Let’s analyze such a pattern:

MATCH (usr:User {birthPlace: "London"})
RETURN usr.name, usr.birthPlace

Here, we are selecting nodes with the User label while filtering for nodes with birthPlace equal to London. The RETURN statement asks Neo4j to only return the name and the birthPlace property of the matched nodes. The result of the preceding query, based on the data created earlier, is as follows:

╒══════════╤════════════════╕
│"usr.name"│"usr.birthPlace"│
╞══════════╪════════════════╡
│"Bob"     │"London"        │
├──────────┼────────────────┤
│"Carol"   │"London"        │
├──────────┼────────────────┤
│"Dave"    │"London"        │
└──────────┴────────────────┘

This is a simple MATCH statement, but most of the time, you’ll need to traverse the graph somehow to explore relationships. This is where Cypher is very convenient. You can write queries with an easy-to-remember syntax, close to the one you would use when drafting your query on a piece of paper. As an example, let’s find the users known by Alice, and return their names:

MATCH (alice:User {name: "Alice"})-[:KNOWS]->(u:User)
RETURN u.name

The highlighted part in the preceding query is a graph traversal. From the node(s) matching label, User, and name, Alice, we are traversing the graph toward another node through a relationship of the KNOWS type. In our toy dataset, there is only one matching node, Bob, since Alice is connected to a single relationship of this type.

Note

In our example, we are using a single-node label and relationship type. You are encouraged to experiment by adding more data types. For instance, create some nodes with the Product label and relationships of the SELLS/BUYS type between users and products to build more complex queries.

Summary

In this chapter, you learned about the specificities of graph databases and started to learn about Neo4j and the tools around it. Now, you know a lot more about the Neo4j ecosystem, including plugins such as APOC and the graph data science (GDS) library and graph applications such as Neo4j Browser and Neodash. You installed Neo4j on your computer and created your first graph database. Finally, you created your first nodes and relationships and built your first Cypher MATCH statement to extract data from Neo4j.

At this point, you are ready for the next chapter, which will teach you how to import data from various data sources into Neo4j, using built-in tools and the common APOC library.

Further reading

If you want to explore the concepts described in this chapter in more detail, please refer to the following references:

  • Graph Databases, The Definitive Book of Graph Databases, by I. Robinson, J. Webber, and E. Eifrem (O’Reilly). The authors, among which is Emil Eifrem, the CEO of Neo technologies, explain graph databases and graph data modeling, also covering the internal implementation. Very instructive!
  • Learning Neo4j 3.x - Second Edition, by J. Baton and R. Van Bruggen. Even if written for an older version of Neo4j, most of the concepts it describes are still valid – the newer Neo4j versions have mostly added new features such as clustering for scalability, without breaking changes.
  • The openCypher project (https://opencypher.org/) and GQL specification (https://www.gqlstandards.org/) to learn about graph query language beyond Cypher.

Exercises

To make sure you fully understand the content described in this chapter, you are encouraged to think about the following exercises before moving on:

  1. Which information do you need to define a graph?
  2. Do you need a graph dataset to start using a graph database?
  3. True or false:
    1. Neo4j can only be started with Neo4j Desktop.
    2. The application to use to create dashboards from Neo4j data is Neo4j Browser.
    3. Graph data science is supported by default by Neo4j.
  4. Are the following Cypher syntaxes valid, and why/why not? What are they doing?
    1. MATCH (x:User) RETURN x.name
    2. MATCH (x:User) RETURN x
    3. MATCH (:User) RETURN x.name
    4. MATCH (x:User)-[k:KNOWS]->(y:User) RETURN x, k, y
    5. MATCH (x:User)-[:KNOWS]-(y) RETURN x, y
  5. Create more data (other node labels/relationship types) and queries.
Left arrow icon Right arrow icon

Key benefits

  • Extract meaningful information from graph data with Neo4j's latest version 5
  • Use Graph Algorithms into a regular Machine Learning pipeline in Python
  • Learn the core principles of the Graph Data Science Library to make predictions and create data science pipelines.

Description

Neo4j, along with its Graph Data Science (GDS) library, is a complete solution to store, query, and analyze graph data. As graph databases are getting more popular among developers, data scientists are likely to face such databases in their career, making it an indispensable skill to work with graph algorithms for extracting context information and improving the overall model prediction performance. Data scientists working with Python will be able to put their knowledge to work with this practical guide to Neo4j and the GDS library that offers step-by-step explanations of essential concepts and practical instructions for implementing data science techniques on graph data using the latest Neo4j version 5 and its associated libraries. You’ll start by querying Neo4j with Cypher and learn how to characterize graph datasets. As you get the hang of running graph algorithms on graph data stored into Neo4j, you’ll understand the new and advanced capabilities of the GDS library that enable you to make predictions and write data science pipelines. Using the newly released GDSL Python driver, you’ll be able to integrate graph algorithms into your ML pipeline. By the end of this book, you’ll be able to take advantage of the relationships in your dataset to improve your current model and make other types of elaborate predictions.

Who is this book for?

If you’re a data scientist or data professional with a foundation in the basics of Neo4j and are now ready to understand how to build advanced analytics solutions, you’ll find this graph data science book useful. Familiarity with the major components of a data science project in Python and Neo4j is necessary to follow the concepts covered in this book.

What you will learn

  • Use the Cypher query language to query graph databases such as Neo4j
  • Build graph datasets from your own data and public knowledge graphs
  • Make graph-specific predictions such as link prediction
  • Explore the latest version of Neo4j to build a graph data science pipeline
  • Run a scikit-learn prediction algorithm with graph data
  • Train a predictive embedding algorithm in GDS and manage the model store

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 31, 2023
Length: 288 pages
Edition : 1st
Language : English
ISBN-13 : 9781804612743
Vendor :
Google
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jan 31, 2023
Length: 288 pages
Edition : 1st
Language : English
ISBN-13 : 9781804612743
Vendor :
Google
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 90.96 99.97 9.01 saved
Graph Data Science with Neo4j
€35.99
Graph Data Modeling in Python
€33.99
Causal Inference and Discovery in Python
€20.98 €29.99
Total 90.96 99.97 9.01 saved Stars icon
Banner background image

Table of Contents

15 Chapters
Part 1 – Creating Graph Data in Neo4j Chevron down icon Chevron up icon
Chapter 1: Introducing and Installing Neo4j Chevron down icon Chevron up icon
Chapter 2: Importing Data into Neo4j to Build a Knowledge Graph Chevron down icon Chevron up icon
Part 2 – Exploring and Characterizing Graph Data with Neo4j Chevron down icon Chevron up icon
Chapter 3: Characterizing a Graph Dataset Chevron down icon Chevron up icon
Chapter 4: Using Graph Algorithms to Characterize a Graph Dataset Chevron down icon Chevron up icon
Chapter 5: Visualizing Graph Data Chevron down icon Chevron up icon
Part 3 – Making Predictions on a Graph Chevron down icon Chevron up icon
Chapter 6: Building a Machine Learning Model with Graph Features Chevron down icon Chevron up icon
Chapter 7: Automatically Extracting Features with Graph Embeddings for Machine Learning Chevron down icon Chevron up icon
Chapter 8: Building a GDS Pipeline for Node Classification Model Training Chevron down icon Chevron up icon
Chapter 9: Predicting Future Edges Chevron down icon Chevron up icon
Chapter 10: Writing Your Custom Graph Algorithms with the Pregel API in Java Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(6 Ratings)
5 star 50%
4 star 50%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Nathan Smith Mar 30, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The examples in this book are really useful, and people to whom I have recommended the book have also liked it. The author does a good job of walking you through the whole graph data science lifecycle, end-to-end. It is especially valuable in providing guidance that goes beyond the Neo4j product documentation on complex topics like link prediction pipelines. I have not seen a better explanation of the Neo4j Pregel API anywhere.
Amazon Verified review Amazon
kbpr Mar 14, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Graph Data Science with Neo4j is a fantastic overview for those starting out in their graph data science journey. The author walks the reader through everything from installation, graph theory and graph algorithms, all the way to supervised machine learning pipelines. She has very clear descriptions and explanations throughout, building up concepts step by step to *show* the reader how to interact with graphs, rather than “telling” and providing a list of features. The result is a solid jumping off point for aspiring graph data scientists, with many great tidbits for the more experienced graph practitioners to return to and reference.
Amazon Verified review Amazon
Phani Dathar Mar 17, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Estelle has done an excellent job of making this book a must read for both the beginners and graph practitioners alike. Given her extensive experience building solutions using graph technology, the practical advice to the readers on how to leverage Neo4j graph database and Neo4j graph data science to address real-world use cases is very valuable.The book starts with step-by-step walkthrough of creating a graph database, introduction to cypher and easy-to-follow examples of ingesting data and building a knowledge graph. It is very important for readers to understand how to store and extract the connections in their data and understand the topology of the network before using graph machine learning and/or building applications on graph databases. The first three chapters in this book combined with the chapter on graph visualization do an excellent job in laying that foundation for moving on to the advanced concepts in the graph data science world.Graph algorithms and graph machine learning are the advanced topics that are also covered in this book with a good set of explanations on why, where and how-to use them. Leveraging graph algorithms to generate graph features, graph embeddings for dimensionality reduction, building machine learning pipelines by integrating graph features are all explained very well and author's expertise shines through in this book.I would recommend this book as a must-read for developers, data scientists and analysts who are starting their journey with graph technology and graph data science as well as graph practitioners who are familiar with the concepts of network science.
Amazon Verified review Amazon
Shanthababu Pandian May 26, 2023
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Data Science is a buzzing word in recent eras; collecting data from various sources and representing them is a challenging task. A graph is a mathematical object set and derived by vertices or nodes and edges. It has the characteristics of being easily traversed, going from one node to another by following edges.The author started his innings by defining what is a graph database in Part 1 and answering by stating that “Data is saved into nodes, which can be connected through edges to model relationships between them”The author introduces the Neo4j ecosystem and outlines the Neo4j Browser, Neo4j Bloom, and Neodash. Helping with clear steps to create a Neo4j database, inserting data into Neo4j with Cypher, the Neo4j query language. Importing Data into Neo4j to Build a Knowledge Graph.Importing Data into Neo4j to Build a Knowledge Graph using CSV data into Neo4j with Cypher with the classical data from netflix.zip from defining the graph schema to creating nodes and relationships the author provided the detailed steps. Introducing Awesome Procedures on Cypher (APOC) Neo4j plugin and playing with JSON data is a great credit for readers to understand how to utilise the Neo4j environment.In Part 2 the author takes us on a tour of exploring and characterising Graph Data with Neo4j, where we can learn how to characterise a graph dataset, Neo4j Graph Data Science (GDS) library and the most common graph algorithms to analyse the graph topology. Characterising a graph from its node and edge properties like Link direction, Link weight and Node type.Coming to computing the graph degree distribution author gives an idea of the definition of a node’s degree Incoming, Outgoing and Total degree precisely. Help us to understand computing the node degree with Cypher, Building the degree distribution of a graph and how to Improve degree distribution.Under characterising metrics, the author gives heads-up on Triangle count, various clustering coefficients, and their calculations. Furthermore, digging into the Neo4j GDS library and installing the same with Neo4j Desktop is a power pack for the reader and makes them hands-on concerning GDS.Visualizing Graph Data is the final goal for Graph data, the author has provided a detailed walkthrough of visualising the complexity of graph data visualisation, small graph with networkx and matplotlib and large graphs with Gephi.In the end, we have to make the prediction, the author has covered this in Part 3 as Making Predictions on a Graph. In this part. Here the author re-introduces a well-known Python library, namely sci-kit-learn, and extracts data from Neo4j to build a model and how the GDS library helps us to embeddings to build node classification and link prediction pipelines.Build a Machine Learning Model with Graph Features using the GDS Python client and explain how the GDS library is allowing it to run graph algorithms directly from Python without writing any Cypher.The author has provided a very detailed note on graph embedding techniques and algorithms – classification, Node2Vec and building a GDS pipeline for classification model and trainingand predicting future edges are fruitful topics and must-read.Overall … I can give 4.0/5.0 for this. Certainly, a special effort from the author is really much appreciable.-Shanthababu PandianArtificial Intelligence and Analytics | Cloud Data and ML Architect | Scrum MasterNational and International Speaker | Blogger |
Amazon Verified review Amazon
Om S Mar 13, 2023
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Graph Data Science with Neo4j" is a practical guide for data scientists who are looking to enhance their skill set by learning how to extract meaningful information from graph data. The book covers various essential concepts and practical instructions for implementing data science techniques on graph data using the latest Neo4j version 5 and its associated libraries. The book is divided into ten chapters, starting with an introduction to Neo4j, installation, and building a knowledge graph.The book then moves on to characterizing a graph dataset and using graph algorithms to analyze it. Readers will also learn how to visualize graph data and build a machine learning model with graph features. The book introduces readers to the newly released GDSL Python driver, which allows for the integration of graph algorithms into a machine learning pipeline.The latter part of the book covers more advanced topics such as building a GDS pipeline for node classification model training and predicting future edges. The book concludes by teaching readers how to write their custom graph algorithm with the Pregel API.Overall, "Graph Data Science with Neo4j" is a useful and practical guide for data scientists looking to learn about graph data science. The book covers a range of topics, from the basics to advanced techniques. The step-by-step instructions and real-world examples make the book easy to follow and understand.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.