by Dan McCreary, Head of AI at TigerGraph
A little semantics goes a long way.
We have had the privilege of demonstrating our new CoPilot GenAI product to several customers. However, we quickly found that many of our customers had never had to use generative AI to work with their knowledge graphs. Binding a simple chat question to a query requires an LLM to “understand your” graph. To do this, similarity measures are used to match questions with data elements such as a vertex, edge, attribute, or enumerated value (code list). Although picking good vertex and relationship names is essential, it is just a starting point. A complete solution needs precise definitions for all data elements and enumerated values.
Here is the critical insight: optimizing your definitions to work with LLMs helps.
Let’s show you why this is key.
Writing Precise but Clear Definitions
When we send a request to an LLM to generate a query, we often need to send along the most similar data elements to what the user is looking for. This is called prompt enrichment. But we don’t want to flood the LLM with a massive description of each data element that includes complex business rules. Remember, we are often paying for each token in our input prompts.
Here are the five criteria I use when teaching people to write a good definition:
- Precise
- Concise
- Distinct
- Noncircular
- Unencumbered with rules
Let’s see if we can find some standards to guide us. Note that although we want to focus on the semantics or meaning of a data element, sometimes, in the real world, we also need to talk about how a data element is represented in our graph as a data type. This is especially true if we pass values to functions that expect specific formats.
Semantics and Representation
When we discuss metadata, we generally refer to two aspects of a data element:
- The semantics or meaning of the data elements are stored within the data element. The critical question is do we have a shared meaning between two items?
- The representation of the data element often involves how the data is stored and the data type of a specific attribute (string, integer, float). The critical question is how the item is stored in a computer.
For example, “CustomerAcqusitionDate” might reference the date a new customer signed a PO, but how the date is represented might be a string in an Excel spreadsheet or a DateTime element within TigerGraph.
Registries and Repositories
Some standards just focus on the semantics of items and their goal is to remove duplicates – two data elements that mean the same thing. We call these standards “metadata registries”. Other standards just scrape metadata from many databases and store them for use later. These are called metadata repositories. They have a similar-sounding name, but their role is very different.
In this article, we will focus on semantics. Because that is what our LLM agents need to get their work done.
Leveraging ISO/IEC 11179 Standards
The ISO/IEC 11179 Metadata Registry (MDR) guidelines define a set of standards for the representation of data elements in a clear, consistent, and shareable format. When defining a data element according to these guidelines, several attributes are emphasized to ensure high quality and utility across diverse applications and contexts. Here, we’ll explore these attributes with respect to the topics you’ve mentioned:
Precise
- Definition Clarity: The definition of a data element should be clear and specific, leaving no room for ambiguity or misinterpretation. It must accurately reflect the nature and characteristics of the data element, ensuring that users can understand its meaning and application without confusion. Try to use the most precise words you can.
- Detail Level: Preciseness might also entail providing enough detail to uniquely identify the data element’s properties, including its value domain, data type, and any relevant constraints or parameters that define its use. Sometimes a data element can be represented by multiple data types. For example a data might be a string, a numeric value or a date-time structure.
Concise
- Brevity with Completeness: While being detailed, the definition should also be concise, avoiding unnecessary words or phrases that do not contribute to a deeper understanding of the data element. The goal is to be as brief as possible while still being completely informative.
- Efficiency of Language: Use clear and straightforward language to facilitate easy comprehension and avoid technical jargon that could alienate non-specialist users unless absolutely necessary and well-defined elsewhere in the documentation.
Distinct
- Uniqueness: Each data element definition must be distinct, ensuring there is no overlap with other data elements in the registry. This uniqueness helps in avoiding redundancy and confusion, making it easier to identify and utilize data elements correctly.
- Identifiability: The characteristics and name of the data element should make it easily distinguishable from other elements, supporting effective data management and interoperability.
Noncircular
- Independent Definitions: The definition of a data element should stand on its own, not relying on references to other data elements for its understanding. This noncircular approach ensures that each element is comprehensible in isolation, facilitating clarity and simplicity in data management.
- Avoidance of Referential Definitions: Definitions should not be based on or include references to the data element itself or its synonyms, ensuring that the meaning is clear without needing to refer back to other elements or the element being defined.
Unencumbered with Rules
- Separation from Business Rules: Data element definitions should be free from business rules or operational contexts that could limit their applicability across different scenarios. The focus should be on the inherent properties of the data element rather than rules governing its use.
- General Applicability: Ensuring the definition is not tied to specific use cases or scenarios allows for broader application and reusability of the data element across different domains and contexts.
Following these attributes as outlined by the ISO/IEC 11179 guidelines ensures that data element definitions are robust, reusable, and easily understandable, contributing to effective data governance, interoperability, and management in diverse information systems.
Attribute Naming Conventions
Creating names for attributes that are longer also helps user understand how to represent data within those attributes. ISO standards suggest using a “suffix” at the end of each attribute to suggest how the item should be used.
The ISO/IEC 11179 Metadata Registry (MDR) standard includes a naming convention for data elements that concludes with a “representation term.” Representation terms are specific keywords used at the end of data element names to indicate the type of value or the nature of the data contained within the element. These terms help in understanding what kind of information the data element represents and how it should be interpreted or used. The use of representation terms adds clarity and consistency across different domains and systems by providing a standardized way to describe the kind of data each element holds.
Common Representation Terms and Their Descriptions:
- Code: Indicates the data element contains a set of predefined values or identifiers used for classification or categorization purposes. Example: “CountryCode” would use predefined values to represent different countries.
- Date: Specifies that the data element contains calendar dates, potentially including year, month, and day. Example: “BirthDate” represents the date on which a person was born.
- Identifier: Used when the data element uniquely identifies an entity within a system or domain. Identifiers are often unique keys. Example: “StudentIdentifier” uniquely identifies a student in an educational institution.
- Indicator: Represents a binary or boolean value, often used for flags or simple yes/no attributes. Example: “IsActiveIndicator” could denote whether a record is active or inactive.
- Name: Used for data elements that contain names of entities, such as individuals, organizations, or places. Example: “OrganizationName” denotes the name of an organization.
- Number: Indicates that the data element contains numeric values. These could be integers, decimals, or other numeric formats. Example: “EmployeeNumber” might represent the unique numeric identifier for an employee.
- Quantity: Used for data elements that represent a countable number or amount, often associated with units of measure. Example: “TotalAmountQuantity” could represent a sum of money.
- Rate or Ratio: Indicates data elements that express a numerical relationship between two numbers. Example: “SuccessRate” might represent the ratio of successes to attempts.
- Text: Specifies that the data element is composed of alphanumeric characters, including letters, numbers, and symbols, usually forming words or sentences. Example: “DescriptionText” would be used for descriptive paragraphs or sentences.
- Time: Used for data elements that specifically represent times of the day or duration. Example: “MeetingStartTime” represents the start time of a meeting.
- Value: Often used when the data element specifies a particular value from a range or set. Example: “TemperatureValue” would indicate a specific temperature reading.
Using representation terms in data element names according to ISO 11179 standards enhances interoperability and understanding by providing a clear indication of the data type and intended use of each element. This practice supports more effective data governance, data quality management, and system integration efforts.
Creating ISO data element definitions involves specifying attributes that make each data element precise, concise, distinct, noncircular, and unencumbered by specific operational rules, according to the ISO 11179 guidelines.
Examples of Vertex Attributes
One of the best ways to get started is to give you examples of definitions of attributes in the context of their Vertex Type:
Person
- Name: PersonFamilyName
- Definition: The surname or last name associated with an individual, representing the family or ancestral name passed down through generations or adopted through marriage or by personal choice.
- DataType: String
- Example: Johnson
Organization
- Name: Organization Legal Name
- Definition: The official legal name of an organization as registered in official state, country, or international records, including corporations, companies, institutions, or associations.
- DataType: String
- Example: Acme Corporation
Bank Account
- Name: Bank Account Number
- Definition: A unique sequence of numbers assigned to a bank account within a banking institution that identifies the account for the purposes of deposit, withdrawal, and transfers.
- DataType: String
- Example: 123456789012
Credit Card
- Name: Credit Card Number
- Definition: A unique 16-digit number embossed or printed on a credit card that identifies the card issuer and the cardholder account.
- DataType: String
- Example: 1234 5678 9012 3456
Purchase
- Name: Purchase Transaction ID
- Definition: A unique identifier assigned to a specific purchase transaction that facilitates tracking and reconciliation of purchases.
- DataType: String
- Example: TXN123456789
Vendor
- Name: Vendor ID
- Definition: A unique alphanumeric code assigned to a vendor to identify and differentiate it from other vendors in procurement and transaction systems.
- DataType: String
- Example: VEND-12345
City
- Name: City Name
- Definition: The official name of a city or town as recognized by local or national government authorities.
- DataType: String
- Example: Springfield
State
- Name: State Code
- Definition: A standardized abbreviation or code representing a state, province, or territory within a country, used for administrative and postal purposes.
- DataType: String
- Example: NY (for New York)
Zipcode
- Name: Postal Code
- Definition: A series of letters, numbers, or both assigned to different geographic areas or addresses to facilitate mail sorting and delivery.
- DataType: String
- Example: 12345
County
- Name: County Name
- Definition: The name of a county or similar administrative division within a country that is used for geographic, administrative, and statistical purposes.
- DataType: String
- Example: Madison County
These examples aim to be compliant with ISO 11179 by ensuring that each definition is clear, specific, and capable of being applied consistently across different contexts.
Definitions for Enumerated Values
The ISO/IEC 11179 Metadata Registry (MDR) standards provide a structured approach for defining data elements, including those that use enumerated values or code lists. Enumerated values, also known as controlled vocabularies or code lists, are predefined lists of permissible values that a data element can take. These are particularly useful for ensuring consistency, reliability, and interoperability of data, especially when sharing across different systems or domains. Here’s how ISO 11179 suggests creating precise definitions for data elements with enumerated values:
1. Define Each Value Clearly and Precisely
For each value in the enumeration or code list, provide a clear and unambiguous definition. This definition should convey the meaning of the value in a way that is easily understood by users, without requiring additional context or reference to external documents.
2. Assign Unique Identifiers
Each enumerated value should have a unique identifier or code that distinguishes it from other values in the list. These identifiers should be consistent and follow a logical naming convention that reflects the nature of the data element.
3. Provide a Defintion for the Code List
Beyond defining individual values, ISO 11179 standards recommend providing a definition for the entire code list or enumeration. This description should explain the purpose of the code list, its scope, and any specific considerations relevant to its use.
4. Specify Applicability and Constraints
Clearly state any constraints on the use of the code list, including applicability in certain contexts or restrictions on combining with other data elements. This ensures that the enumerated values are used appropriately and maintain their intended meaning across different implementations.
5. Ensure Non-Ambiguity
Make sure that the definitions of enumerated values are non-ambiguous, distinct, and mutually exclusive. This clarity prevents overlap or confusion about how and when to use specific values.
6. Maintain Stability and Version Control
Enumerated values and their definitions should be stable over time to ensure consistency of data. If changes are necessary, maintain version control and provide documentation about the changes, including the rationale and impact on existing data.
7. Document the Source or Authority
If the enumerated values are derived from a standard or authoritative source, document this source within the definition. This can lend credibility and ensure alignment with industry or domain-specific practices.
Example: Employment Status
- Name: EmploymentStatus
- Definition: Indicates the current employment status of an individual.
- DataType: Code
- Enumeration:
- FullTime: Engaged in employment for the majority of working hours in a week.
- PartTime: Engaged in employment for less than the majority of working hours in a week.
- Unemployed: Not engaged in employment but available for work.
- Retired: Not seeking employment due to retirement.
By following these guidelines, organizations can create precise, reliable, and interoperable definitions for data elements with enumerated values, enhancing data quality and utility across diverse applications.
Final Note
The phrase “A little semantics goes a long way.” is a phrase from Jim Hendler. It was originally coined to help intelligent agents find insights on the web back in 1997. The same principles apply today.
If you want to learn more about creating LLM-friendly definitions for TigerGraph CoPilot contact us at info@localhost.