Graph Traversal Using Apache TinkerPop: Gremlin

Anirban DeyAnirban Dey
9 min read

Introduction

Welcome to my blog on graph traversal with Apache TinkerPop and Gremlin! In today's data-driven world, understanding complex relationships is vital. Graph databases offer a flexible solution for capturing interconnected data, and Apache TinkerPop serves as a standardized framework for graph computing. At the heart of TinkerPop lies Gremlin, a powerful query language designed for graph traversal.

In this blog, we'll explore the fundamentals of graph databases, the role of Apache TinkerPop, and the expressive capabilities of Gremlin. We'll guide you through the installation of the Gremlin Console, provide practical examples, and equip you with the knowledge to unlock the hidden insights in your data.

Join us on this exciting journey as we dive into the world of graph traversal with Apache TinkerPop and Gremlin. Let's uncover the limitless possibilities of connected data together!

Graph Databases

Graph databases are a type of NoSQL database that uses graph structures to represent and store data.
They consist of nodes (vertices) that represent entities and edges that represent relationships between those
entities. Graph databases excel in handling highly connected data and are suitable for various use cases,
including social networks, recommendation engines, fraud detection, and knowledge graphs.

Gremlin and its Use:

Gremlin is a graph traversal language that allows users to perform complex queries and traversals on graph databases. It provides a concise and expressive syntax to navigate, filter, and manipulate the data stored in the graph. Gremlin is supported by several graph databases, including Apache TinkerPop, Amazon Neptune, JanusGraph, and Cosmos DB. Gremlin is widely used in various domains and software applications. Some notable examples include:
Social Networks: Gremlin is used by social media platforms to analyze and recommend connections between users based on their interests, relationships, and activities.

Recommendation Engines: E-commerce platforms leverage Gremlin to generate personalized product recommendations by analyzing the purchase history, browsing patterns, and preferences of users.

Fraud Detection: Gremlin is employed in fraud detection systems to identify suspicious patterns and connections in financial transactions or online activities.

Knowledge Graphs: Gremlin is utilized to query and navigate large knowledge graphs that store structured information about various domains such as science, medicine, and literature.

This blog aims to provide a comprehensive guide to using Gremlin in your project and harnessing the power of graph databases. Here, we will use Apache TinkerPop for our purpose.

Installation:

This section provides step-by-step instructions on how to install Gremlin Console on an Ubuntu system. Gremlin Console is a command-line tool that allows you to interactively execute Gremlin queries and commands.

  • Prerequisites:

    Before installing Gremlin Console, ensure that you have the following prerequisites:

    1. Ubuntu system with administrative privileges.

    2. Java Development Kit (JDK) or Java Runtime Environment (JRE) installed. Gremlin Console requires Java to run. You can install Open-JDK using the package manager :

    sudo apt update 

    sudo apt install default-jre 

    sudo apt install default-jdk 

    javac -version 

    java -version

Installing Gremlin Console:

To install Gremlin Console on your Ubuntu system, follow these steps:

Step 1 : Downloading Gremlin-Console

  1. Go to Apache TinkerPop: Gremlin official website https://tinkerpop.apache.org/gremlin.html

  2. Navigate to the "Downloads" section of the website. You can typically find it in the main menu or as a prominent link on the homepage.

  3. Look for the "Current Release" section or a similar heading. The specific naming and location may vary depending on the website's layout. Within the "Current Release" section, locate the link related to "Gremlin Console" downloads. This link will enable you to download the Gremlin Console, which is a tool for interacting with Apache TinkerPop: Gremlin.

  4. Click on the download link for the Gremlin Console.

  5. Once the download is finished, you will have the Apache TinkerPop: Gremlin Console package saved on your computer.

Step 2 : Installing and Starting the Gremlin console

  • Open a terminal window on your Ubuntu system and navigate to your Downloads folder where you've downloaded the zip file (e.g- apache-tinkerpop-gremlin-console-3.6.4-bin.zip)

      cd Downloads
    
  • Unzip the file using

      sudo unzip apache-tinkerpop-gremlin-console-3.6.4-bin.zip
    
  • Rename the file with a proper name

      sudo mv apache-tinkerpop-gremlin-console-3.6.4-bin gremlin_console
    
  • Navigate to the bin folder in gremlin_console file:

      cd gremlin_console/bin
    
  • We have to run the shell file named gremlin.sh , to start the gremlin-console.

      #Make the shell file executable
      chmod +x gremlin.sh
      # Run the shell file
      ./gremlin.sh
    

If a console similar to the above appears in your terminal, congratulations! You have successfully installed and run the Gremlin Console on your device.

Some Basic Gremlin Commands

Gremlin Console provides a dynamic and interactive environment for querying graph databases using the powerful Gremlin query language. Whether you're new to graph databases or an experienced developer, the Gremlin Console offers a user-friendly platform to navigate and manipulate your graph data. In this blog post, we will guide you through the process of getting started with Gremlin Console, from connecting to your graph database to executing basic queries and exploring advanced traversal techniques. So let's dive in and unlock the potential of graph querying with Gremlin!

Creating a TinkerGraph:

graph = TinkerGraph.open()

This output will be ==>tinkergraph[vertices:0 edges:0]; this output indicates the initial state of an empty graph when using the TinkerGraph implementation. It signifies that there are no vertices (nodes) or edges (relationships) present in the graph. It serves as a confirmation that you are working with a clean slate and there is no existing data in the graph.

Loading a graphML file:

Suppose you have have your data in a .csv format and you want to load that data as a graph on your gremlin-console, you have to convert it to a .graphml file and then execute the following query to load that data.

graph.io(graphml()).readGraph('your_file.graphml')
g = graph.traversal()

To convert your .csv file to .graphml file, a sample Python code is been given. Please modify the following code for your purpose.

import csv
import networkx as nx
# Read the CSV file
with open("D:\\CHROME DOWNLOAD\\Jan_2019_ontime.csv", ’r’) as f:
reader = csv.DictReader(f)
data = [row for row in reader]
# Create a directed graph
G = nx.DiGraph()
# Add edges and vertices to the graph
for row in data:
src = row[’ORIGIN’]
dst = row[’DEST’]
distance = float(row[’DISTANCE’])
src_name2 =row[’ORIGIN_AIRPORT_ID’]
dst_name_2=row[’DEST_AIRPORT_ID’]
# Add vertices with name and city properties
G.add_node(src, ID=src, ORI_AIR_ID=src_name2)
G.add_node(dst, ID=dst, DST_AIR_ID=dst_name_2)
# Check if an edge already exists between src and dst
if G.has_edge(src, dst):
# Keep only the edge with the largest distance
if G[src][dst][’distance’] < distance:
G[src][dst][’distance’] = distance
else:
G.add_edge(src, dst, distance=distance)
# Write the graph to a GraphML file
nx.write_graphml(G, "D:\\CHROME DOWNLOAD\\Final_graph.graphml")

The output will be ==>graphtraversalsource[tinkergraph[vertices:346 edges:5535], standard]. This indicates that the graph contains 346 vertices (nodes) and 5535 edges (relationships). These numbers reflect the current count of nodes and relationships within the graph. The "standard" term refers to the standard graph traversal mode, which is the default mode used for traversing and querying the graph.

Find The Vertices & Edges

//Finding all Vertices
g.V()
//Finding all Edges
g.E()

Retrieving all vertices with their incoming and outgoing edges

//Retrieving all vertices with their incoming edges
g.V().inE()
//Retrieving all vertices with their outgoing edges
g.V().outE()

You can find much more from basic to advanced gremlin queries from :

Some experiments

Suppose we are given a CSV file named Jan_2019_ontime.csv which contains data related to some airline flights in January 2019. The columns in the file include:

  1. ORIGIN_AIRPORT_ID: The ID of the origin airport.

  2. DEST_AIRPORT_ID: The ID of the destination airport.

  3. ORIGIN: The code or abbreviation for the origin airport.

  4. DEST: The code or abbreviation for the destination airport.

  5. DISTANCE: The distance in miles between the origin and destination airports.

Each row in the CSV file represents a specific flight, with information about the origin and destination airports and the distance between them.

Now we are trying to find those paths which are like airport "A" is connected to airport "B", also airport "B" is connected to airport "C" but airport "C" is not connected to airport "A" by any flights.

Note that Airport "A" and Airport "C" must have different IDs or Names.

Now we'll try to solve this problem by gremlin.

Step-1: Creating an empty graph

First, open the Gremlin Console on your system then

graph = TinkerGraph.open()

Step-2: Read the data from the created GraphML file

graph.io(graphml()).readGraph('Final_graph.graphml')
g = graph.traversal()

Step-3: Retrieving the desired Resultset

g.V().as('a').outE().as('ab').inV().as('b').outE().as('bc').inV().as('c').where('a', P.neq('c')).where('c', P.neq('b')).where('a', P.neq('c')).select('a', 'b', 'c')

This command will return all such airports "A", "B", "C" such that "A" is connected to "B" by "AB" edge, similarly "B" is connected to "C" by "BC" edge but "A" is not connected to "C" by any edges, i.e there is no direct connection from "A" to "C".

[Provivded that A, B, C are mutually distinct]

If you want to count such paths use

g.V().as('a').outE().as('ab').inV().as('b').outE().as('bc').inV().as('c').where('a', P.neq('c')).where('c', P.neq('b')).where('a', P.neq('c')).select('a', 'b', 'c').count()

Conclusion

In conclusion, Apache TinkerPop and its query language, Gremlin, provide a powerful toolkit for graph traversal and analysis. With the ability to handle highly connected data and represent complex relationships, graph databases have become indispensable in various domains. Throughout this blog, we have explored the fundamentals of graph traversal, the role of Apache TinkerPop as a graph computing framework, and the expressive power of Gremlin. We have learned how to install and set up Gremlin Console, allowing us to interactively execute Gremlin queries and commands. By demonstrating practical examples of Gremlin traversal on real-life datasets, we have showcased the versatility and utility of graph traversal in uncovering insights and patterns within interconnected data. From retrieving vertices and edges to filtering, navigating, and performing advanced computations, Gremlin has proved to be a valuable asset in graph-based data analysis.

Apache TinkerPop's vendor-agnostic nature and the widespread adoption of Gremlin by various graph databases make it a reliable and versatile tool for developers and data scientists. By leveraging Apache TinkerPop and Gremlin, you can seamlessly work with different graph database implementations, empowering you to build robust and scalable graph-based applications. As you continue your journey with graph databases and traversal, we encourage you to explore further and experiment with the rich features and functionalities offered by Apache TinkerPop and Gremlin. The official documentation and resources provided by the Apache TinkerPop community are excellent starting points for deepening your knowledge and unlocking the full potential of graph traversal.

With Apache TinkerPop and Gremlin, you are equipped with the tools to navigate and analyze complex relationships, extract valuable insights, and make informed decisions based on interconnected data. Embrace the power of graph traversal and unlock new possibilities in your data-driven projects.

Happy graph traversing!

0
Subscribe to my newsletter

Read articles from Anirban Dey directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anirban Dey
Anirban Dey