I Tried to Automate Knowledge Graph Schema and It Blew My Mind
In this Story, I have a super quick tutorial showing you how to automate the knowledge graph schema to build a powerful agent chatbot for your business or personal use.
If you’ve worked in a development AI agent for long enough, you eventually stop thinking about your Nodes, Tables, edges, normalised schemas — those become second nature. That’s where I was.
Until one weekend, I got curious.
I have worked for a couple of clients for a while. I know Knowledge graphs (KGs) can already organize massive amounts of complex information into structured, machine-readable knowledge. But one big problem when building a knowledge graph is that it usually needs a fixed structure, called a schema, before you even start.
Think of it like trying to build a Lego castle, but someone tells you exactly where every brick must go before you begin. That might work for one type of castle, but what if you want to build a spaceship next? You’d have to start all over again with a new plan.
In the old way, experts would have to design these schemas by prompting LLM, which limits the scalability, adaptability, and it only works well for one topic or domain. If new data comes in or if the topic changes, the whole graph might stop working or need major updates. It’s not very flexible.
But a new method I discovered to solve this problem automatically induces schemas directly from unstructured text using large language models, enabling fully autonomous, large-scale knowledge graph construction that can dynamically adapt to diverse and evolving domains without redesigning the schema.
AutoSchemaKG has significantly improved construction efficiency. According to experimental data, compared with traditional methods, this framework can shorten the construction time of knowledge graphs by about 70% while maintaining a high accuracy rate.
This achievement not only reduces costs but also makes it possible to update large-scale knowledge graphs in real time, truly realizing the combination of “intelligence” and “efficiency”.
So, let me give you a quick demo of a live chatbot to show you what I mean.
I will ask the chatbot a question: “Who is Alex?” If you take a look at how the chatbot generates the output, you’ll see that the agent searches through its internal knowledge graph. It made sure all the nodes in the graph have the required attributes like 'type'
, 'id'
, and 'file_id'
So every part of the graph is well-structured and ready for retrieval. If any of these were missing, the agent automatically assigned sensible defaults, like marking a node as "text"
If its ID matched known text entries, or "entity"
otherwise.
Once the graph is ready, the agent uses a sentence encoder to turn the question, the graph’s nodes, edges, and text content into vector embeddings. These embeddings are then stored in FAISS indexes, which makes the retrieval process super fast.
After that, I used the HippoRAG2Retriever
to combine the LLM generator and the graph data. When I asked “Who is Alex?”, the retriever scanned the graph’s text, nodes, and edges for the most relevant matches based on similarity scores. It picked the top 2 most relevant pieces of context, sorted them, and passed them into the modelLLMGenerator
, which then used the context to generate a final answer.
So, by the end of this Story, you will understand what AutoSchemaKG is, how it works, and how we are going to automate the Knowledge Graph Schema to create a powerful Agentic chatbot.
This code will be on my Patreon because it takes me a lot of time to build, and i you could support me, I will appreciate that.
Before we start! 🦸🏻♀️
If you like this topic and you want to support me:
Like my article, which will really help me out.👏
Follow me on my YouTube channel
Subscribe to me to get the latest articles.
What is AutoSchemaKG?
AutoSchemaKG is a framework for building a knowledge graph (KG) completely autonomously, eliminating the need for predefined schemas. The system uses large language models (LLMs) to perform knowledge triple extraction and schema induction simultaneously, directly from text data in a web-scale corpus.
AutoSchemaKG is the conceptualization process that drives schema induction. It generalizes concrete entities, events, and relations into broader conceptual categories through abstraction mechanisms.
This conceptualization includes: building semantic bridges between different information, supporting zero-shot cross-domain reasoning, reducing sparsity in KGs, and providing a hierarchical organization structure that supports both concrete and abstract reasoning.
How it Works
AutoSchemaKG converts unstructured text into a structured knowledge graph through a two-part process. In the first part, it uses a large language model to extract three types of relationships in stages: entity-entity relations, such as identifying that “Einstein” worked at “Princeton”; entity-event relations, such as linking “Einstein” to the “discovery of the theory of relativity”; and event-event relations, such as connecting “World War I” to “World War II.”
Each relationship is turned into a triple — two elements connected by a relation — and stored with the original text and metadata. In the second part, called schema induction, the system abstracts specific entities, events, and relations into higher-level concept types using the language model. For example, “Einstein” might be labelled as a “scientist,” and “Theory of Relativity” as a “scientific theory.”
It uses information from neighboring nodes to add more context, processes everything in batches for speed, and saves the results in a CSV file. This allows the final knowledge graph to be flexible, scalable, and usable across different domains without manual schema design.
GraphRag Vs AutoSchemaKG
GraphRAG and AutoSchemaKG can’t compete together. Each approach has its unique advantages and is suited for different stages.
GraphRAG excels at leveraging existing or manually curated knowledge graphs to enhance retrieval and reasoning tasks, especially when high-quality, domain-specific graphs are available.
AutoSchemaKG focuses on automatically constructing large, flexible, and comprehensive knowledge graphs from unstructured data without manual schemas, enabling scalability and extensive domain coverage.
Together, these approaches can be integrated: AutoSchemaKG can automatically generate knowledge graphs that can later be used by GraphRAG to improve performance in various tasks.
Let’s Start Coding
Let us now explore step by step and unravel the answer to how to automate the Knowledge Graph Schema. We will install the libraries that support the model. For this, we will do a pip install requirements
pip install requirements
The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed.
Atlas-Rag a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas.
I set the environment to use GPU device 1 os.environ['CUDA_VISIBLE_DEVICES'] = '1'
to control which GPU is used during processing. Then, I imported key components like TripleGenerator
, KnowledgeGraphExtractor
, and ProcessingConfig
from the atlas_rag
to work with knowledge graphs, and also brought in the OpenAI
class to connect with a model API.
I developed an OpenAI client using a custom base URL from DeepInfra and an API key to connect with the model. I set the keyword to 'Dulce'
and created an output directory path based on that keyword. Finally, I initialized the TripleGenerator
using the OpenAI client and customising it with parameters likemax_new_tokens = 4096
a low temperature = 0.1
For more deterministic results, and frequency_penalty = 1.1
to reduce repetition in the output.
import os
from atlas_rag import TripleGenerator, KnowledgeGraphExtractor, ProcessingConfig
from openai import OpenAI
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
model_name = "meta-llama/Llama-3.3-70B-Instruct"
client = OpenAI(
base_url="https://api.deepinfra.com/v1/openai",
api_key="",
)
keyword = 'Dulce'
output_directory = f'import/{keyword}'
triple_generator = TripleGenerator(client, model_name=model_name,
max_new_tokens = 4096,
temperature = 0.1,
frequency_penalty = 1.1)
Then I made the knowledge graph extraction pipeline, and I created a ProcessingConfig
called kg_extraction_config
, where I set the model path to the LLaMA 3.3 model, I pointed the data source to the "example_data"
folder, filtered files using the keyword
As the filename pattern, set a small batch size of 4 for manageable processing, and define the output directory.
So I made a KnowledgeGraphExtractor
using the triple_generator
and my custom config, and developed the logic to start the extraction with run_extraction()
, which automatically reads the input data, generates triples, and writes them to JSON. Lastly, I added a step to convert the extracted JSON data into a structured CSV file using convert_json_to_csv()
to make the results easy to view and analyse.
kg_extraction_config = ProcessingConfig(
model_path=model_name,
data_directory="example_data",
filename_pattern=keyword,
batch_size=4,
output_directory=f"{output_directory}",
)
kg_extractor = KnowledgeGraphExtractor(model=triple_generator, config=kg_extraction_config)
kg_extractor.run_extraction()
kg_extractor.convert_json_to_csv()
After that, I developed a script to manually generate the concept CSV files and then build a complete directed knowledge graph in Graph format. First, I made sure the concept_csv
directory exists inside the output folder. I read the original nodes and edges from the triples_csv
directory and saved exact copies of them as concept_nodes
and triple_edges
In the new concept folder.
Since there were no explicit concept-to-concept links, I created an empty concept_edges
CSV with the correct column structure. Then, I used NetworkX to build a directed graph (DiGraph
).
I added each node from the original node file with detailed attributes like id
, type
, concepts
, and synsets
, and also added text nodes by reading a separate text_nodes
CSV. Next, I developed edges in the graph by linking entities/events using the data from the original edges file, and added additional “mentions” edges from the text_edges
file, if present.
Finally, I created a kg_graphml
directory and exported the full graph to a .graphml
file, summarizing the result with a print statement showing the total number of nodes and edges created.
import pandas as pd
import os
# Create concept_csv directory
os.makedirs(f"{output_directory}/concept_csv", exist_ok=True)
# Read original files
nodes_df = pd.read_csv(f"{output_directory}/triples_csv/triple_nodes_{keyword}_from_json_without_emb.csv")
edges_df = pd.read_csv(f"{output_directory}/triples_csv/triple_edges_{keyword}_from_json_without_emb.csv")
# Manually create what create_concept_csv should have created:
# 1. concept_nodes file (copy of original nodes)
nodes_df.to_csv(f"{output_directory}/concept_csv/concept_nodes_{keyword}_from_json_with_concept.csv", index=False)
# 2. triple_edges file (copy of original edges)
edges_df.to_csv(f"{output_directory}/concept_csv/triple_edges_{keyword}_from_json_with_concept.csv", index=False)
# 3. concept_edges file (empty since no concepts)
concept_edges_df = pd.DataFrame(columns=[':START_ID', ':END_ID', 'relation', ':TYPE'])
concept_edges_df.to_csv(f"{output_directory}/concept_csv/concept_edges_{keyword}_from_json_with_concept.csv", index=False)
print("Concept CSV files created manually")
# Now create the GraphML with proper 'id' attributes
import networkx as nx
G = nx.DiGraph()
# Add entity/event nodes
for _, row in nodes_df.iterrows():
node_id = str(row['name:ID'])
G.add_node(node_id,
id=node_id,
file_id=node_id,
type=str(row['type']),
concepts=str(row['concepts']),
synsets=str(row['synsets']),
label=str(row[':LABEL']))
# Add entity/event edges
for _, row in edges_df.iterrows():
G.add_edge(str(row[':START_ID']), str(row[':END_ID']),
relation=str(row['relation']),
type=str(row[':TYPE']))
# Add text edges if they exist
text_edges_file = f"{output_directory}/triples_csv/text_edges_{keyword}_from_json.csv"
if os.path.exists(text_edges_file):
text_edges_df = pd.read_csv(text_edges_file)
for _, row in text_edges_df.iterrows():
G.add_edge(str(row[':START_ID']), str(row[':END_ID']),
relation="mentions",
relation="mentions",
nx.write_graphml(G, f"{output_directory}/kg_graphml/{keyword}_graph.graphml")
print(f"GraphML created: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
I built the RAG pipeline to integrate the key components needed for semantic retrieval and answer generation. I started selecting the "sentence-transformers/all-MiniLM-L6-v2"
model, known for its speed and accuracy, and loaded it using SentenceTransformer
with trust_remote_code=True
and device_map="auto"
to ensure it runs efficiently on the available hardware.
I wrapped this model with SentenceEmbedding
to transform user queries and documents into dense vectors for similarity-based retrieval. Then, I connected my previously configured OpenAI client and LLaMA 3.3 model to the LLMGenerator
, which I used to generate natural language answers based on the retrieved information.
# Continue with the RAG setup from the documentation
from sentence_transformers import SentenceTransformer
from atlas_rag.retrieval import SentenceEmbedding
from atlas_rag.reader import LLMGenerator
from atlas_rag import create_embeddings_and_index
# Step 4: Setup RAG components
encoder_model_name = "sentence-transformers/all-MiniLM-L6-v2"
sentence_model = SentenceTransformer(encoder_model_name, trust_remote_code=True, model_kwargs={'device_map': "auto"})
sentence_encoder = SentenceEmbedding(sentence_model)
llm_generator = LLMGenerator(client=client, model_name=model_name)
Then I made a complete embedding and indexing pipeline to prepare the knowledge graph data for efficient semantic retrieval. I extracted the original text and their IDs from the text_nodes_df
dataframe and built a dictionary mapping each text ID to its corresponding text. Then, I gathered the full list of nodes and edges from the graphG
, converting them into strings to use as input for embeddings.
I used compute_text_embeddings
along with the sentence_encoder
to compute vector embeddings for three types of elements: original texts, graph nodes, and edges. For each of these, I printed progress updates to track the embedding process. After that, I built FAISS indexes using a helper function create_faiss_index
, which normalises and indexes the embeddings using the IndexHNSWFlat
structure with inner product similarity.
I created separate indexes for text, node, and edge embeddings, and also built a combined graph index using both node and edge embeddings. Finally, I wrapped all of this into a unified A data
dictionary that includes the graph, lists, embeddings, FAISS indexes, and mappings—setting the stage for fast and flexible retrieval during the RAG workflow.
from atlas_rag.retrieval.indexer import compute_text_embeddings
import faiss
import numpy as np
# Prepare data for embeddings
original_text_list = text_nodes_df['original_text'].tolist()
text_id_list = text_nodes_df['text_id:ID'].tolist()
# Create text dictionary
text_dict = {text_id: text for text_id, text in zip(text_id_list, original_text_list)}
# Get node and edge lists from the updated graph G
node_list = list(G.nodes())
edge_list = list(G.edges())
edge_list_string = [f"{edge[0]} -> {edge[1]}" for edge in edge_list]
print(f"Computing embeddings for {len(node_list)} nodes, {len(edge_list)} edges, {len(original_text_list)} texts")
# Compute embeddings
print("Computing text embeddings...")
text_embeddings = compute_text_embeddings(original_text_list, sentence_encoder, 64, True)
print("Computing node embeddings...")
node_embeddings = compute_text_embeddings(node_list_string, sentence_encoder, 64, True)
print("Computing edge embeddings...")
edge_embeddings = compute_text_embeddings(edge_list_string, sentence_encoder, 64, True)
# Create FAISS indexes
def create_faiss_index(embeddings):
if len(embeddings) == 0:
return None
dimension = len(embeddings[0])
index = faiss.IndexHNSWFlat(dimension, 64, faiss.METRIC_INNER_PRODUCT)
X = np.array(embeddings).astype('float32')
index.add(X)
return index
text_faiss_index = create_faiss_index(text_embeddings)
node_faiss_index = create_faiss_index(node_embeddings)
edge_faiss_index = create_faiss_index(edge_embeddings)
graph_faiss_index = create_faiss_index(node_embeddings + edge_embeddings)
print(f"Created {len(text_embeddings)} text, {len(node_embeddings)} node, {len(edge_embeddings)} edge embeddings")
# Create comprehensive data structure with updated graph G
data = {
# Graph data (G now includes both entity/event nodes AND text nodes)
'graph': G,
'KG': G,
# Node data
'node_list': node_list,
'node_embeddings': node_embeddings,
'node_list_string': node_list_string,
'node_faiss_index': node_faiss_index,
# Edge data
'edge_list': edge_list,
'edge_embeddings': edge_embeddings,
'edge_list_string': edge_list_string,
'edge_faiss_index': edge_faiss_index,
# Combined
'node_and_edge_embeddings': node_embeddings + edge_embeddings,
# Text data
# Combined graph index
'graph_faiss_index': graph_faiss_index,
}
I ensured all nodes in the graph had the necessary attributes to check each one for 'type'
, 'id'
, and 'file_id'
. If a node was missing 'type'
I set it to "text"
if the ID matched a known text ID, or defaulted it to "entity"
. Missing 'id'
and 'file_id'
Fields were both filled in with the node's own ID.
I verified the fix to print attributes for a few nodes, then updated the data
structure to include the corrected graph. With that in place, I recreated the RAG system usingHippoRAG2Retriever
, connecting it to the llm_generator
, sentence_encoder
and the full knowledge graph data.
Then I tested a sample query — "Who is Alex?" — retrieving the top 2 relevant pieces of context and generating an answer using the LLM to generate both the content and a clear answer, confirming it’s working as expected.
# Fix: Ensure ALL nodes have required attributes
print("Ensuring all nodes have required attributes...")
for node_id in G.nodes():
node_data = G.nodes[node_id]
# Ensure 'type' attribute exists
if 'type' not in node_data:
# Determine type based on node characteristics
if node_id in text_id_list:
G.nodes[node_id]['type'] = 'text'
else:
G.nodes[node_id]['type'] = 'entity' # Default for missing type
# Ensure 'id' attribute exists
if 'id' not in node_data:
G.nodes[node_id]['id'] = node_id
# Ensure 'file_id' attribute exists
if 'file_id' not in node_data:
G.nodes[node_id]['file_id'] = node_id
print("All nodes now have required attributes")
# Verify by checking a few nodes
print("Verification - checking node attributes:")
for node in list(G.nodes())[:3]:
attrs = G.nodes[node]
print(f"Node {node}: type='{attrs.get('type')}', id='{attrs.get('id')}', file_id='{attrs.get('file_id')}'")
# Update the data structure
data['graph'] = G
data['KG'] = G
# Recreate the retriever
from atlas_rag.retrieval import HippoRAG2Retriever
hipporag2_retriever = HippoRAG2Retriever(
llm_generator=llm_generator,
sentence_encoder=sentence_encoder,
data=data,
)
print("RAG system recreated successfully!")
# Test the system
content, sorted_context_ids = hipporag2_retriever.retrieve("Who is Alex?", topN=2)
print(f"Retrieved content: {content}")
sorted_context = "\n".join(content)
generate_with_context("Who is Alex?", sorted_context, max_new_tokens=2048, temperature=0.5)
print(f"Answer: {answer}")
Conclusion:
AutoSchemaKG not only demonstrates the cutting-edge progress of knowledge graph construction technology but also opens up a new direction for future intelligent information processing and knowledge management.
Through automated pattern induction and knowledge extraction, knowledge graphs will become more flexible and efficient, and better able to adapt to the rapidly changing information environment