Crossing the Chasm – Eight Prerequisites For A Graph Query Language
- Blog >
- Crossing the Chasm – Eight Prerequisites For A Graph Query Language
Prelude
In December, I wrote a Quora post on the pros and cons of graph databases. I shared two cons pervasive in the market today: the difficulty of finding proficient graph developers, and how non-standardization on a graph query language is slowing down enterprise adoption, especially where it is needed most.
A solution to these two problems is availability of a simple, yet powerful standard graph query language. There are many graph languages on the market, including Neo4j’s Cypher, Apache TinkerPop Gremlin and TigerGraph’s GSQL, to name a few. Before discussing which graph language is the best, or fusing the best aspects of each graph language into a new, unified option, let us take a step back to ask a more fundamental question: What are the prerequisites for a graph query language in the first place?
Prerequisites For A Graph Language
At first sight, it is a hard question to answer! Based on the input from many users, and on my more than 18-year career journey in data management, I will attempt to answer this using a series of connected questions.
First, Why Does The World Pay Attention To The Graph Model?
Intel introduced the 8008 processor on April 1, 1972. This was followed by the availability of the first personal computers, introduced in 1975. During this era, more and more enterprise information became digitized. A pressing requirement of the market was an easy to use data management application to help with bookkeeping, and to generate ad hoc reports for business communication. Due to the scarcity of memory, the disk-based relational database was born, riding on the normalization theory to reduce data redundancy and improve data integrity. The relational model (or tabular model) met the hardware constraint and market requirements, and thus flourished over the next three decades. At that time, QUEL and SQL were two competing relational query languages. Both languages are expressive and easy to use, and they satisfied the market requirements. As the market evolved, SQL eventually won and became widely adopted by the industry.
Time flew by. Hardware continued to evolve while data volume increased at a rapid pace. The market demanded scalable software to manage this ever-growing data. Since 2000, many Massively Parallel Processing (MPP) database vendors have gained ground by addressing this requirement — these include Teradata, Greenplum, Netezza, Vertica, AsterData, Redshift, etc. These MPP databases still adhere to the relational model but use MPP to solve the scalability problem. The success of MPP databases can be attributed to SQL.Thanks to SQL’s declarative nature, applications written in SQL are data-independent, and thus can be ported to single-node or multiple-node RDBMS systems with ease.
The modern era is witnessing the data deluge and the interconnected data. The uniqueness and power of interconnected data is proven by the emergence of giant internet companies, including Google (search engine), Facebook (social network), LinkedIn (professional network), and Twitter (online news and social graph). Nowadays, mobile messengers such as Instagram, WhatsApp and WeChat, and mobile payment providers such as Paypal, WeChat Pay and Alipay generate enormous volumes of graph data every second. As people, machines, cars, mobile devices, etc. generate colossal amounts of interconnected data, we are truly in the midst of the network age! A pressing demand for managing graph data has emerged.
For the first time in the history of data management, the relational model no longer works, because relational databases cannot efficiently join multiple tables in order to traverse multiple links in the interconnected data. Typically, the relational model blindly scans all of or large portions of the participating tables, and uses the join predicates to find data record connections. One can argue that an index (an auxiliary data structure to speed up ad hoc record lookup) can help. However, a savvy database administrator will be happy to tell you that an index is a redundant copy of a small portion of the relational table, and if the data are changing frequently, it is a big challenge, if not infeasible, to keep the updated tables and indices in sync without sacrificing query performance.
More importantly, it is not even possible to write SQL queries to discover and predict relationships hidden in many tables– we cannot “connect the dots” by joining the tables when we don’t know for sure in advance what types of relationships may be involved.
The graph model comes to the rescue by offering these beneficial features:
- Natural and economic storage model for interconnected data.
- Natural model for managing ever-growing object types.
- Natural analytical computational model for knowledge/inference/learning – chaining and combining observations.
Why does the graph model need a new query language?
As discussed so far, the graph model is a perfect fit for today’s interconnected data. However, because it is new, and graph-oriented thinking is different, it takes some effort for developers to learn and use it with ease. Most computer science textbooks today will dedicate a special section on graph algorithms, with education on the graph model and graph algorithms oriented around appreciating the theoretical complexity and elegance of graph problems. Coding graph algorithms is another matter, with textbooks using general purpose programming language such as Java or C++. Furthermore, most graph algorithms in textbooks typically assume all graph data can fit in memory and don’t consider distributed computing. Consequently, there is a shortage of practical training and tools for real-world graph problems especially for big graph datasets.
For any enterprise adopting a graph model, a top concern is where do they find qualified developers to efficiently write graph applications?
The solution? Lowering the barrier to learning and implementing graph applications with – A Declarative Graph Query Language! Just as SQL and QUEL significantly lowered the barrier for managing relational data, a well-designed, user-friendly, and highly expressive graph query language can significantly bridge the gap between asking high-level real-life questions and easily crafting a graph-based solution.
What Are The Prerequisites Of A Graph Query Language?
That being said, based on years of experience working on real-world graph management problems from our forward thinking customers, I have summarized the key requirements for solving their problems, all of which serve as an empirical answer to our question What Are The Prerequisites Of A Graph Query Language?
- Schema-Based with Capability of Dynamic Schema Change.
As explained before, the success of SQL for relational database programming is largely attributed to data independence, where the user works with a logic level data schema that is independent of the underlying storage mechanism, and changes to the logic level data schema will be transparent to existing applications (so they won’t be broken). With data independence, all the lower level changes are transparent to the upper level. Applications written on a pre-postulated schema (data model) can be ported to any data management software and hardware that understand the schema. By the same token, a graph model and graph query language should embrace data independence. To be more precise, a graph query language should support the definition of logical graph schemas. Because data models change over time in the real world, the language and system should also support dynamic schema changes.
- High-Level Control of Graph Traversal
The SQL language is purely declarative, meaning users ask questions without worrying how the database software processes the queries. This relieves the user from coding a high-performance algorithm to perform their task. The same fashion can be applied to designing an elegant graph query language. For simple queries, it can be declarative, such as a pattern match query. E.g., this sample query below finds all customers whose age is greater than 23 and who has bought a product produced by Apple and also purchased by Peter.
- Fine Control of Graph Traversal
In addition to high-level queries, we see large customer demand for coding sophisticated iterative graph mining algorithms and traversal algorithms, where they want to traverse the graph data, add tags and side effects (runtime attributes), with iterative updates until a final query result is computed. For example, consider algorithms for collaborative filtering recommendation, PageRank, community detection, label propagation, similarity-based ranking, etc. See https://doc.tigergraph.com/GSQL-Tutorial-and-Demo-Examples.html. Based on our experience serving some of the largest companies in the world, customers come to graph management to design and execute complex algorithms that SQL cannot solve. Therefore, fine control of both the the graph query and the graph model’s elements – vertex and edge instances – is a must for a standard graph query language. For example, consider a PageRank query.
- Built-in Parallel Semantics to Guarantee High Performance
Graph algorithms are expensive, as each hop could potentially increase the data complexity exponentially. Let’s consider how: as we start from one vertex, the first hop can yield n neighbors, the next hop on the graph from the n immediate neighbors can yield n² more neighbors, and so on. As a result, each traversal step in the graph should have built-in parallel processing semantics to ensure high performance.
- A Highly Expressive Loading Language
The world is a graph. Ingesting data silos into a central graph schema needs an expressive and flexible schema-mapping loading language. Otherwise, even integrating a couple of data sources into the logical graph schema will be a daunting task. A comprehensive graph language needs to include an easy-to-use loading functionality to help onboard heterogeneous data quickly. This is one of the highest priority requirements for a graph query language and associated system.
- Data Security and Privacy
Enterprise customers are keen on facilitating their own collaboration with graph data. On one hand, they want a graph model that can share selected parts of the data among multiple departments and even among partners. At the same time, they also want to keep certain sensitive data private and accessible only by specific roles, departments and partners, based on their business need. A MultiGraph with a role-based security model is required for a successful graph query language in the real world for large enterprises.
- Support For Queries Calling Queries (recursively)
This is a less obvious requirement in terms of implementing a graph-based solution that generates business value. Quite often, our customers want to ask the same query for all vertices in the graph, the so-called batch processing. For example, for a product recommendation problem, they want to find recommendations for each and every customer. To write a batch query like this in one graph query is much harder than writing a single customer recommendation query because the data structures and algorithmic flow are much more complex. To solve this problem, a query-calling-query feature comes to rescue. In the main query, for each vertex we call the single customer recommendation query. The same pattern can be applied for other complex graph batch processing, where divide-and-conquer makes the graph query logic simpler and resource-wise more scalable. This also enables users to write and organize queries in more modularized way.
- SQL User-Friendly
We found most of our customers have SQL developers. The graph query language, therefore should stay close to SQL syntax and concepts as much as possible so that relational SQL developers can learn and implement graph applications with minimal ramp-up time.
How Does TigerGraph GSQL Address These Eight Prerequisites?
Based on customer feedback over five years, TigerGraph’s world-class team has co-invented with our customers, over several iterations, a Turing-complete graph query language called GSQL. It includes DDL (Data Definition Language) commands for defining a graph schema, a loading language for high-performance loading of structured and semi-structured data, and a SQL-like graph traverse and compute language with implicit parallelism.
- Schema-Based with Capability of Dynamic Schema Change.
GSQL uses Vertex type and Edge type as schema building blocks. A Graph is defined as a collection of Vertex types and Edge types. One TigerGraph database may contain multiple (possibly overlapping) graphs, each with its own set of user privileges and queries. E.g.,
Also, GSQL’s DDL supports dynamic schema change with SQL-like syntax. This is a must-have requirement commonplace in managing enterprise data today. With the DDL, all GSQL queries are written against the logic model. Some of our customers started with a single machine, and as their data size grows, they migrate to multiple machines leveraging our MPP graph database to scale out performance and storage. Thanks to our data independence query language, their applications are transparent to the scale out architecture change, and migrated seamlessly from single node to a cluster of nodes.
- High-Level Control of Graph Traversal
GSQL’s building block is a single SQL-like block with a one-hop pattern. For example, “find all customers who are 23 or more years old and who bought an iPhone” is expressed as follows:
GSQL users can write a series of single GSQL blocks to traverse the graph any number of steps, each step being a one-hop path pattern. In other words, it is very high-level declarative traversal, simple and clear. It is easy to read and maintain. Our customers without computer science degrees can master it in one or two days and start coding graph algorithms on their domain data.
- Fine Control of Graph Traversal
GSQL introduces a set of built-in accumulators, which serve as the vertices’ runtime attributes (a.k.a. properties). These runtime attributes can attach to each and every vertex instance seen during traversal in parallel, thus users can add tags or side effects on vertices during runtime traversal to gather evidences for complex applications such as fraud detection and personalized recommendations. GSQL also has flow control components such as WHILE loops, FOREACH loops, and IF-ELSE/CASE-WHEN branching to allow complex iterative or branching algorithms. Together with local and global accumulators, GSQL really puts the user’s hands on the steering wheel. Indeed, it’s Turing-complete, meaning any algorithm on the graph data can be coded within GSQL, therefore no graph data are needed to be exchanged back and forth between a graph database system and a client-side application which can slows the query performance orders of magnitudes.
- Built-in Parallel Semantics to Guarantee High Performance
A pair of innovative ACCUM and POST-ACCUM clauses in combination with built-in accumulator types encode the parallel processing of a vertex set or edge set, within the same blocks used for high-level graph traversal. This means GSQL enables users to achieve parallel processing of graph data very easily. This is a great advantage of the graph query language to encourage users with parallel thinking without worrying about the implementation details. Of course, not every graph query needs this powerful feature. PageRank and community detection types of algorithms can take advantage of this feature.
- A Highly Expressive Loading Language
The loading language is high expressive and declarative. Our users love it. We see customers learning it in half an hour and use it to integrate 10s to 100s data sources into a single graph model in one day.
- Security for Data Sharing and Privacy At The Same Time
The industry’s first multiple-graph model was born in GSQL and helps our customers to achieve data sharing and privacy at the same time.
- Support For Queries Calling Queries
A GSQL can call a query in several places: in the ACCUM and POST-ACCUM clauses, at the statement level, and even in the called query body itself. This gives the most flexibility and simplicity for solving real-life graph traversal problems. A query can also call itself recursively. This feature enables divide-and-conquer, greatly simplify query logics and encourage writing composable queries.
- SQL User-Friendly
GSQL re-uses as much as possible of SQL syntax, semantics, and keywords. GSQL can in fact be seen and learned as a chain of single SQL blocks.
To learn more, check out our white paper offering some high level comparison of GSQL with two other graph query language — Gremlin and Cypher.
Got questions? Reach out to us at [email protected].
For those of you going to SIGMOD 2018 (June 10-15, Houston, Texas), I’d love to chat. Stop by TigerGraph’s booth #21 and ask any questions regarding our graph query language GSQL.