Utilities API Reference¶

The Heritage Data Processor Utilities API provides helper endpoints for file parsing, metadata extraction, intelligent field mapping, and integration with external services like Ollama LLM.

Base URL¶

All endpoints are prefixed with /utils.

File Parsing¶

Get YAML Keys¶

Parses a YAML file and returns its top-level or nested keys, with automatic exclusion of common system keys.

Endpoint: POST /utils/get_yaml_keys

Request Body:

{
  "file_path": "/config/metadata_mapping.yaml",
  "parent_key": "zenodo_mappings"
}

Request Parameters:

file_path (string, required): Absolute path to the YAML file
parent_key (string, optional): Parent key to retrieve nested keys from. If omitted, returns top-level keys

Response (Top-Level Keys):

{
  "keys": [
    "metadata_extraction",
    "description_generation",
    "keyword_suggestion"
  ]
}

Response (Nested Keys):

{
  "keys": [
    "title_mapping",
    "creator_mapping",
    "date_mapping"
  ]
}

Key Exclusion:

When retrieving top-level keys, the following system keys are automatically excluded:

settings
version
sources

Status Codes:

200 OK: Keys retrieved successfully
404 Not Found: File not found at specified path
500 Internal Server Error: YAML parsing error or unexpected exception

Get Table Headers¶

Reads a CSV, TSV, or Excel file and returns its column headers.

Endpoint: POST /utils/get_table_headers

Request Body:

{
  "file_path": "/data/metadata.csv"
}

Request Parameters:

file_path (string, required): Absolute path to the table file

Response:

{
  "success": true,
  "headers": [
    "ID",
    "Title",
    "Creator",
    "Date",
    "Description",
    "Keywords"
  ]
}

Supported File Types:

Excel: .xls, .xlsx (uses openpyxl engine)
TSV: .tsv (tab-separated values)
CSV: All other extensions, defaults to comma-separated

Status Codes:

200 OK: Headers retrieved successfully
404 Not Found: File not found at specified path
500 Internal Server Error: Failed to read file (corrupt file, encoding issues, etc.)

Get JSON Keys¶

Parses a JSON file and returns its top-level keys in sorted order.

Endpoint: POST /utils/get_json_keys

Request Body:

{
  "file_path": "/output/pipeline_results.json"
}

Request Parameters:

file_path (string, required): Absolute path to the JSON file

Response:

{
  "success": true,
  "keys": [
    "metadata",
    "results",
    "summary",
    "timestamp",
    "validation_status"
  ]
}

Validation:

The endpoint validates that the JSON file contains a dictionary (object) at the root level. Array-based JSON files will return an error.

Status Codes:

200 OK: Keys retrieved successfully
400 Bad Request: Missing file_path parameter or JSON root is not a dictionary
404 Not Found: File not found at specified path
500 Internal Server Error: Invalid JSON format or file read error

Preview Spreadsheet¶

Reads and returns the first 5 rows of a CSV or Excel file for preview purposes.

Endpoint: POST /utils/preview_spreadsheet

Request Body:

{
  "filePath": "/data/metadata.xlsx"
}

Request Parameters:

filePath (string, required): Absolute path to the spreadsheet file

Response:

{
  "success": true,
  "columns": [
    "ID",
    "Title",
    "Creator"
  ],
  "previewData": [
    {
      "ID": 1,
      "Title": "Medieval Manuscript",
      "Creator": "Unknown"
    },
    {
      "ID": 2,
      "Title": "Ancient Artifact",
      "Creator": "Smith, J."
    }
  ]
}

Response Fields:

success (boolean): Operation success status
columns (array): List of column headers
previewData (array): Array of row objects (maximum 5 rows), with column headers as keys

Data Format:

The previewData uses orient="records" format from pandas, where each row is a dictionary with column names as keys.

Status Codes:

200 OK: Preview generated successfully
404 Not Found: File not found at specified path
500 Internal Server Error: Error reading file

Template Processing¶

Get Template Variables¶

Extracts all placeholder variables from a template file using pattern matching.

Endpoint: POST /utils/get_template_variables

Request Body:

{
  "file_path": "/templates/description_template.txt"
}

Request Parameters:

file_path (string, required): Absolute path to the template file

Response:

{
  "success": true,
  "variables": [
    "creator",
    "date",
    "description",
    "title"
  ]
}

Pattern Matching:

The endpoint searches for placeholders in the format {variable}, extracting alphanumeric variable names including underscores. Variables are returned in sorted, deduplicated order.

Supported Formats:

{variable_name}
{VariableName123}

Status Codes:

200 OK: Variables extracted successfully
404 Not Found: File not found or path not provided
500 Internal Server Error: File read error

Get Prompt Placeholders¶

Parses a prompts YAML file and finds all unique {placeholder} strings within a specific prompt configuration.

Endpoint: POST /utils/get_prompt_placeholders

Request Body:

{
  "file_path": "/config/prompts.yaml",
  "prompt_id": "metadata_extraction"
}

Request Parameters:

file_path (string, required): Absolute path to the prompts YAML file
prompt_id (string, required): Identifier of the prompt section to analyze

Response:

{
  "success": true,
  "placeholders": [
    "creator",
    "filename",
    "format",
    "title"
  ]
}

Search Scope:

The endpoint searches for placeholders within the system and user fields of all prompt variations under the specified prompt_id. Placeholders are extracted using regex pattern \{(\w+)\} and returned in sorted order.

Status Codes:

200 OK: Placeholders extracted successfully
400 Bad Request: Missing file_path or prompt_id, or file not found
500 Internal Server Error: YAML parsing error or unexpected exception

Get Prompt Content¶

Retrieves the system prompt, user prompt, and suggested model from a prompts YAML file.

Endpoint: POST /utils/get_prompt_content

Request Body:

{
  "file_path": "/config/prompts.yaml",
  "prompt_id": "metadata_extraction",
  "prompt_key": "default"
}

Request Parameters:

file_path (string, required): Absolute path to the prompts YAML file
prompt_id (string, required): Top-level prompt identifier
prompt_key (string, required): Sub-key within the prompt ID (e.g., default, detailed, concise)

Response:

{
  "success": true,
  "system": "You are an expert metadata curator specializing in cultural heritage data.",
  "user": "Extract metadata from the following file: {filename}. The file contains information about {title}.",
  "suggested_model": "llama3.1:8b"
}

Response Fields:

success (boolean): Operation success status
system (string): System prompt text, or "Not defined." if missing
user (string): User prompt template text, or "Not defined." if missing
suggested_model (string, nullable): Recommended model identifier, or null if not specified

Status Codes:

200 OK: Prompt content retrieved successfully
400 Bad Request: Missing required fields
500 Internal Server Error: File read or YAML parsing error

Schema Processing¶

Get Template Mapping Info¶

Analyzes a template file against XML Schema Definition (XSD) files to extract controlled vocabularies and element type mappings.

Endpoint: POST /utils/get_template_mapping_info

Request Body:

{
  "schema_dir": "/schemas/edm",
  "template_file": "/templates/edm_template.xml"
}

Request Parameters:

schema_dir (string, required): Path to directory containing .xsd schema files
template_file (string, required): Path to template file containing ${variable} placeholders

Response:

{
  "success": true,
  "variables": [
    {
      "name": "dc_type",
      "has_vocab": true,
      "vocab_values": [
        "TEXT",
        "IMAGE",
        "VIDEO",
        "SOUND",
        "3D"
      ]
    },
    {
      "name": "dc_creator",
      "has_vocab": false,
      "vocab_values": []
    }
  ]
}

Response Fields:

success (boolean): Operation success status
variables (array): List of template variable objects
name (string): Variable name extracted from template
has_vocab (boolean): Whether a controlled vocabulary exists for this variable
vocab_values (array): List of allowed values if controlled vocabulary exists

Processing Logic:

The endpoint performs the following steps:

Schema Parsing: Parses all .xsd files in the schema directory to build type definitions and element-to-type mappings
Variable Extraction: Extracts variables from template using pattern ${variable_name}
Namespace Resolution: Splits variables like dc_creator into prefix (dc) and local name (creator)
Type Lookup: Maps element names to their schema types
Vocabulary Extraction: Retrieves enumeration values for simple types with controlled vocabularies

Supported Namespace Prefixes:

dc: Dublin Core Elements (http://purl.org/dc/elements/1.1/)
edm: Europeana Data Model (http://www.europeana.eu/schemas/edm/)
dcterms: Dublin Core Terms (http://purl.org/dc/terms/)
skos: Simple Knowledge Organization System (http://www.w3.org/2004/02/skos/core#)

Dependency:

Requires lxml library to be installed. If not available, logs a warning and returns empty mappings.

Status Codes:

200 OK: Template analyzed successfully
400 Bad Request: Missing schema_dir or template_file
500 Internal Server Error: Schema parsing error or template read error

Intelligent Mapping¶

Automap Zenodo Fields¶

Automatically suggests mappings from table column headers to Zenodo metadata fields using keyword matching and similarity algorithms.

Endpoint: POST /utils/automap_zenodo_fields

Request Body:

{
  "headers": [
    "Title",
    "Author Name",
    "Publication Date",
    "Summary",
    "Tags"
  ]
}

Request Parameters:

headers (array, required): List of column header names from a table

Response:

{
  "success": true,
  "mapping": {
    "title": "Title",
    "creators": "Author Name",
    "publication_date": "Publication Date",
    "description": "Summary",
    "keywords": "Tags"
  }
}

Response Fields:

success (boolean): Operation success status
mapping (object): Dictionary mapping Zenodo field names to table column headers

Mapping Algorithm:

The endpoint uses a two-level matching strategy:

Level 1 - Keyword Matching: Exact match of normalized header against predefined keywords for each Zenodo field

Level 2 - Similarity Matching: For remaining unmapped fields, uses difflib.SequenceMatcher with threshold of 0.8 to find best matches

Zenodo Field Keywords:

title: ["title", "headline", "name"]
description: ["description", "summary", "abstract"]
creators: ["author", "creator", "artist", "writer"]
publication_date: ["date", "publicationdate", "pub_date", "year"]
keywords: ["keywords", "tags", "subjects"]

Normalization:

Headers and keywords are normalized by converting to lowercase and removing spaces and underscores.

Status Codes:

200 OK: Mapping generated successfully
400 Bad Request: No headers provided

Automap Columns¶

Intelligently maps table columns to prompt placeholders using a sophisticated four-level matching algorithm.

Endpoint: POST /utils/automap_columns

Request Body:

{
  "table_path": "/data/metadata.csv",
  "prompts_path": "/config/prompts.yaml",
  "prompt_id": "metadata_extraction"
}

Request Parameters:

table_path (string, required): Path to CSV or Excel file containing data
prompts_path (string, required): Path to YAML file containing prompts
prompt_id (string, required): Identifier of the prompt to analyze

Response:

{
  "success": true,
  "mapping": {
    "title": "Title",
    "creator": "Author",
    "date": "Creation_Date",
    "description": "Summary"
  }
}

Response Fields:

success (boolean): Operation success status
mapping (object): Dictionary mapping placeholder names to table column headers

Four-Level Matching Algorithm:

The endpoint employs an optimized, layered matching strategy:

Level 1 - Exact Match: Direct string equality between placeholder and header (most efficient)

Level 2 - Normalized Match: Comparison after converting to lowercase and removing spaces/underscores

Level 3 - Plural/Singular Match: Handles simple plural forms by adding/removing trailing 's'

Level 4 - Similarity Match: Uses difflib.SequenceMatcher with threshold of 0.8 for fuzzy matching

Performance Optimizations:

Pre-normalizes all strings once for efficiency
Uses compiled regex pattern cached at function level
Optimizes text concatenation using join()
Early returns when no placeholders or headers found
Tracks used headers to prevent duplicate mappings

Status Codes:

200 OK: Mapping completed successfully
400 Bad Request: Missing required paths, prompt ID, or invalid YAML format
404 Not Found: File not found
500 Internal Server Error: Unexpected error during mapping

External Service Integration¶

List Ollama Models¶

Lists available Large Language Models from the Ollama service.

Endpoint: GET /utils/list_ollama_models

Response:

{
  "success": true,
  "models": [
    "llama3.1:8b",
    "llama3.1:70b",
    "mistral:7b",
    "codellama:13b"
  ]
}

Response Fields:

success (boolean): Operation success status
models (array): List of model names available in Ollama

Error Responses:

Library Not Installed:

{
  "success": false,
  "error": "Ollama library not installed on the server."
}

Service Not Reachable:

{
  "success": false,
  "error": "Ollama not reachable.",
  "details": "Connection refused: localhost:11434"
}

Dependencies:

Requires ollama Python library to be installed
Requires Ollama service to be running and accessible

Status Codes:

200 OK: Models retrieved successfully or error message returned

Error Handling¶

Common Error Scenarios¶

File Not Found:

POST /utils/get_yaml_keys
{"file_path": "/nonexistent/file.yaml"}

Response: 404 Not Found

{
  "error": "File not found"
}

Missing Required Parameters:

POST /utils/get_prompt_content
{"file_path": "/config/prompts.yaml"}

Response: 400 Bad Request

{
  "error": "Missing required fields"
}

Invalid JSON Structure:

POST /utils/get_json_keys
{"file_path": "/data/array.json"}

Where array.json contains ``

Response: 400 Bad Request

{
  "success": false,
  "error": "The provided file is not a JSON object (dictionary)."
}

YAML Parsing Error:

Response: 500 Internal Server Error

{
  "success": false,
  "error": "Invalid YAML format: ..."
}

Usage Examples¶

Example 1: Extract Placeholders and Auto-Map Columns¶

Step 1: Get placeholders from prompts

POST /utils/get_prompt_placeholders
Content-Type: application/json

{
  "file_path": "/config/prompts.yaml",
  "prompt_id": "description_generation"
}

Step 2: Auto-map to table columns

POST /utils/automap_columns
Content-Type: application/json

{
  "table_path": "/data/artifacts.csv",
  "prompts_path": "/config/prompts.yaml",
  "prompt_id": "description_generation"
}

Example 2: Preview Spreadsheet and Map to Zenodo¶

POST /utils/preview_spreadsheet
Content-Type: application/json

{"filePath": "/data/metadata.xlsx"}

Then use the returned headers:

POST /utils/automap_zenodo_fields
Content-Type: application/json

{
  "headers": ["Title", "Author", "Date", "Abstract"]
}

Example 3: Analyze Template with Schema Vocabularies¶

POST /utils/get_template_mapping_info
Content-Type: application/json

{
  "schema_dir": "/schemas/europeana",
  "template_file": "/templates/edm_record.xml"
}

Example 4: List Available LLM Models¶

GET /utils/list_ollama_models

Logging¶

Application Logger¶

The module uses Python's standard logging with logger name utils_bp:

Error Level:

"Error in get_yaml_keys: {exception}"
"Error in get_table_headers: {exception}"
"Error in get_prompt_placeholders: {exception}"
"Error in preview_spreadsheet: {exception}"
"Failed to get JSON keys from {path}: {exception}"
"Failed to get template mapping info: {exception}"
"Automapping failed: {exception}"
"Failed to get template variables: {exception}"

Warning Level:

"lxml is not installed. Cannot parse XML schemas."
"Could not parse schema file {filename}: {exception}"

All error logs include full stack traces with exc_info=True.

Dependencies¶

Required Libraries¶

pandas: For reading CSV/Excel files
pyyaml: For parsing YAML configuration files
pathlib: For cross-platform path handling
difflib: For similarity-based string matching
re: For regex pattern matching
json: For JSON file parsing

Optional Libraries¶

ollama: For LLM model integration (gracefully degrades if not installed)
lxml: For XML schema parsing (falls back to empty results if not installed)