Utilities API Reference¶
The Heritage Data Processor Utilities API provides helper endpoints for file parsing, metadata extraction, intelligent field mapping, and integration with external services like Ollama LLM.
Base URL¶
All endpoints are prefixed with /utils.
File Parsing¶
Get YAML Keys¶
Parses a YAML file and returns its top-level or nested keys, with automatic exclusion of common system keys.
Endpoint: POST /utils/get_yaml_keys
Request Body:
Request Parameters:
file_path(string, required): Absolute path to the YAML fileparent_key(string, optional): Parent key to retrieve nested keys from. If omitted, returns top-level keys
Response (Top-Level Keys):
Response (Nested Keys):
Key Exclusion:
When retrieving top-level keys, the following system keys are automatically excluded:
settingsversionsources
Status Codes:
200 OK: Keys retrieved successfully404 Not Found: File not found at specified path500 Internal Server Error: YAML parsing error or unexpected exception
Get Table Headers¶
Reads a CSV, TSV, or Excel file and returns its column headers.
Endpoint: POST /utils/get_table_headers
Request Body:
Request Parameters:
file_path(string, required): Absolute path to the table file
Response:
Supported File Types:
- Excel:
.xls,.xlsx(usesopenpyxlengine) - TSV:
.tsv(tab-separated values) - CSV: All other extensions, defaults to comma-separated
Status Codes:
200 OK: Headers retrieved successfully404 Not Found: File not found at specified path500 Internal Server Error: Failed to read file (corrupt file, encoding issues, etc.)
Get JSON Keys¶
Parses a JSON file and returns its top-level keys in sorted order.
Endpoint: POST /utils/get_json_keys
Request Body:
Request Parameters:
file_path(string, required): Absolute path to the JSON file
Response:
Validation:
The endpoint validates that the JSON file contains a dictionary (object) at the root level. Array-based JSON files will return an error.
Status Codes:
200 OK: Keys retrieved successfully400 Bad Request: Missingfile_pathparameter or JSON root is not a dictionary404 Not Found: File not found at specified path500 Internal Server Error: Invalid JSON format or file read error
Preview Spreadsheet¶
Reads and returns the first 5 rows of a CSV or Excel file for preview purposes.
Endpoint: POST /utils/preview_spreadsheet
Request Body:
Request Parameters:
filePath(string, required): Absolute path to the spreadsheet file
Response:
{
"success": true,
"columns": [
"ID",
"Title",
"Creator"
],
"previewData": [
{
"ID": 1,
"Title": "Medieval Manuscript",
"Creator": "Unknown"
},
{
"ID": 2,
"Title": "Ancient Artifact",
"Creator": "Smith, J."
}
]
}
Response Fields:
success(boolean): Operation success statuscolumns(array): List of column headerspreviewData(array): Array of row objects (maximum 5 rows), with column headers as keys
Data Format:
The previewData uses orient="records" format from pandas, where each row is a dictionary with column names as keys.
Status Codes:
200 OK: Preview generated successfully404 Not Found: File not found at specified path500 Internal Server Error: Error reading file
Template Processing¶
Get Template Variables¶
Extracts all placeholder variables from a template file using pattern matching.
Endpoint: POST /utils/get_template_variables
Request Body:
Request Parameters:
file_path(string, required): Absolute path to the template file
Response:
Pattern Matching:
The endpoint searches for placeholders in the format {variable}, extracting alphanumeric variable names including underscores. Variables are returned in sorted, deduplicated order.
Supported Formats:
{variable_name}{VariableName123}
Status Codes:
200 OK: Variables extracted successfully404 Not Found: File not found or path not provided500 Internal Server Error: File read error
Get Prompt Placeholders¶
Parses a prompts YAML file and finds all unique {placeholder} strings within a specific prompt configuration.
Endpoint: POST /utils/get_prompt_placeholders
Request Body:
Request Parameters:
file_path(string, required): Absolute path to the prompts YAML fileprompt_id(string, required): Identifier of the prompt section to analyze
Response:
Search Scope:
The endpoint searches for placeholders within the system and user fields of all prompt variations under the specified prompt_id. Placeholders are extracted using regex pattern \{(\w+)\} and returned in sorted order.
Status Codes:
200 OK: Placeholders extracted successfully400 Bad Request: Missingfile_pathorprompt_id, or file not found500 Internal Server Error: YAML parsing error or unexpected exception
Get Prompt Content¶
Retrieves the system prompt, user prompt, and suggested model from a prompts YAML file.
Endpoint: POST /utils/get_prompt_content
Request Body:
{
"file_path": "/config/prompts.yaml",
"prompt_id": "metadata_extraction",
"prompt_key": "default"
}
Request Parameters:
file_path(string, required): Absolute path to the prompts YAML fileprompt_id(string, required): Top-level prompt identifierprompt_key(string, required): Sub-key within the prompt ID (e.g.,default,detailed,concise)
Response:
{
"success": true,
"system": "You are an expert metadata curator specializing in cultural heritage data.",
"user": "Extract metadata from the following file: {filename}. The file contains information about {title}.",
"suggested_model": "llama3.1:8b"
}
Response Fields:
success(boolean): Operation success statussystem(string): System prompt text, or"Not defined."if missinguser(string): User prompt template text, or"Not defined."if missingsuggested_model(string, nullable): Recommended model identifier, ornullif not specified
Status Codes:
200 OK: Prompt content retrieved successfully400 Bad Request: Missing required fields500 Internal Server Error: File read or YAML parsing error
Schema Processing¶
Get Template Mapping Info¶
Analyzes a template file against XML Schema Definition (XSD) files to extract controlled vocabularies and element type mappings.
Endpoint: POST /utils/get_template_mapping_info
Request Body:
Request Parameters:
schema_dir(string, required): Path to directory containing.xsdschema filestemplate_file(string, required): Path to template file containing${variable}placeholders
Response:
{
"success": true,
"variables": [
{
"name": "dc_type",
"has_vocab": true,
"vocab_values": [
"TEXT",
"IMAGE",
"VIDEO",
"SOUND",
"3D"
]
},
{
"name": "dc_creator",
"has_vocab": false,
"vocab_values": []
}
]
}
Response Fields:
success(boolean): Operation success statusvariables(array): List of template variable objectsname(string): Variable name extracted from templatehas_vocab(boolean): Whether a controlled vocabulary exists for this variablevocab_values(array): List of allowed values if controlled vocabulary exists
Processing Logic:
The endpoint performs the following steps:
- Schema Parsing: Parses all
.xsdfiles in the schema directory to build type definitions and element-to-type mappings - Variable Extraction: Extracts variables from template using pattern
${variable_name} - Namespace Resolution: Splits variables like
dc_creatorinto prefix (dc) and local name (creator) - Type Lookup: Maps element names to their schema types
- Vocabulary Extraction: Retrieves enumeration values for simple types with controlled vocabularies
Supported Namespace Prefixes:
dc: Dublin Core Elements (http://purl.org/dc/elements/1.1/)edm: Europeana Data Model (http://www.europeana.eu/schemas/edm/)dcterms: Dublin Core Terms (http://purl.org/dc/terms/)skos: Simple Knowledge Organization System (http://www.w3.org/2004/02/skos/core#)
Dependency:
Requires lxml library to be installed. If not available, logs a warning and returns empty mappings.
Status Codes:
200 OK: Template analyzed successfully400 Bad Request: Missingschema_dirortemplate_file500 Internal Server Error: Schema parsing error or template read error
Intelligent Mapping¶
Automap Zenodo Fields¶
Automatically suggests mappings from table column headers to Zenodo metadata fields using keyword matching and similarity algorithms.
Endpoint: POST /utils/automap_zenodo_fields
Request Body:
Request Parameters:
headers(array, required): List of column header names from a table
Response:
{
"success": true,
"mapping": {
"title": "Title",
"creators": "Author Name",
"publication_date": "Publication Date",
"description": "Summary",
"keywords": "Tags"
}
}
Response Fields:
success(boolean): Operation success statusmapping(object): Dictionary mapping Zenodo field names to table column headers
Mapping Algorithm:
The endpoint uses a two-level matching strategy:
Level 1 - Keyword Matching: Exact match of normalized header against predefined keywords for each Zenodo field
Level 2 - Similarity Matching: For remaining unmapped fields, uses difflib.SequenceMatcher with threshold of 0.8 to find best matches
Zenodo Field Keywords:
title:["title", "headline", "name"]description:["description", "summary", "abstract"]creators:["author", "creator", "artist", "writer"]publication_date:["date", "publicationdate", "pub_date", "year"]keywords:["keywords", "tags", "subjects"]
Normalization:
Headers and keywords are normalized by converting to lowercase and removing spaces and underscores.
Status Codes:
200 OK: Mapping generated successfully400 Bad Request: No headers provided
Automap Columns¶
Intelligently maps table columns to prompt placeholders using a sophisticated four-level matching algorithm.
Endpoint: POST /utils/automap_columns
Request Body:
{
"table_path": "/data/metadata.csv",
"prompts_path": "/config/prompts.yaml",
"prompt_id": "metadata_extraction"
}
Request Parameters:
table_path(string, required): Path to CSV or Excel file containing dataprompts_path(string, required): Path to YAML file containing promptsprompt_id(string, required): Identifier of the prompt to analyze
Response:
{
"success": true,
"mapping": {
"title": "Title",
"creator": "Author",
"date": "Creation_Date",
"description": "Summary"
}
}
Response Fields:
success(boolean): Operation success statusmapping(object): Dictionary mapping placeholder names to table column headers
Four-Level Matching Algorithm:
The endpoint employs an optimized, layered matching strategy:
Level 1 - Exact Match: Direct string equality between placeholder and header (most efficient)
Level 2 - Normalized Match: Comparison after converting to lowercase and removing spaces/underscores
Level 3 - Plural/Singular Match: Handles simple plural forms by adding/removing trailing 's'
Level 4 - Similarity Match: Uses difflib.SequenceMatcher with threshold of 0.8 for fuzzy matching
Performance Optimizations:
- Pre-normalizes all strings once for efficiency
- Uses compiled regex pattern cached at function level
- Optimizes text concatenation using
join() - Early returns when no placeholders or headers found
- Tracks used headers to prevent duplicate mappings
Status Codes:
200 OK: Mapping completed successfully400 Bad Request: Missing required paths, prompt ID, or invalid YAML format404 Not Found: File not found500 Internal Server Error: Unexpected error during mapping
External Service Integration¶
List Ollama Models¶
Lists available Large Language Models from the Ollama service.
Endpoint: GET /utils/list_ollama_models
Response:
Response Fields:
success(boolean): Operation success statusmodels(array): List of model names available in Ollama
Error Responses:
Library Not Installed:
Service Not Reachable:
{
"success": false,
"error": "Ollama not reachable.",
"details": "Connection refused: localhost:11434"
}
Dependencies:
- Requires
ollamaPython library to be installed - Requires Ollama service to be running and accessible
Status Codes:
200 OK: Models retrieved successfully or error message returned
Error Handling¶
Common Error Scenarios¶
File Not Found:
Response: 404 Not Found
Missing Required Parameters:
Response: 400 Bad Request
Invalid JSON Structure:
Where array.json contains ``
Response: 400 Bad Request
YAML Parsing Error:
Response: 500 Internal Server Error
Usage Examples¶
Example 1: Extract Placeholders and Auto-Map Columns¶
Step 1: Get placeholders from prompts
POST /utils/get_prompt_placeholders
Content-Type: application/json
{
"file_path": "/config/prompts.yaml",
"prompt_id": "description_generation"
}
Step 2: Auto-map to table columns
POST /utils/automap_columns
Content-Type: application/json
{
"table_path": "/data/artifacts.csv",
"prompts_path": "/config/prompts.yaml",
"prompt_id": "description_generation"
}
Example 2: Preview Spreadsheet and Map to Zenodo¶
Then use the returned headers:
POST /utils/automap_zenodo_fields
Content-Type: application/json
{
"headers": ["Title", "Author", "Date", "Abstract"]
}
Example 3: Analyze Template with Schema Vocabularies¶
POST /utils/get_template_mapping_info
Content-Type: application/json
{
"schema_dir": "/schemas/europeana",
"template_file": "/templates/edm_record.xml"
}
Example 4: List Available LLM Models¶
Logging¶
Application Logger¶
The module uses Python's standard logging with logger name utils_bp:
Error Level:
"Error in get_yaml_keys: {exception}""Error in get_table_headers: {exception}""Error in get_prompt_placeholders: {exception}""Error in preview_spreadsheet: {exception}""Failed to get JSON keys from {path}: {exception}""Failed to get template mapping info: {exception}""Automapping failed: {exception}""Failed to get template variables: {exception}"
Warning Level:
"lxml is not installed. Cannot parse XML schemas.""Could not parse schema file {filename}: {exception}"
All error logs include full stack traces with exc_info=True.
Dependencies¶
Required Libraries¶
pandas: For reading CSV/Excel filespyyaml: For parsing YAML configuration filespathlib: For cross-platform path handlingdifflib: For similarity-based string matchingre: For regex pattern matchingjson: For JSON file parsing
Optional Libraries¶
ollama: For LLM model integration (gracefully degrades if not installed)lxml: For XML schema parsing (falls back to empty results if not installed)