Skip to content

Utilities API Reference

The Heritage Data Processor Utilities API provides helper endpoints for file parsing, metadata extraction, intelligent field mapping, and integration with external services like Ollama LLM.

Base URL

All endpoints are prefixed with /utils.


File Parsing

Get YAML Keys

Parses a YAML file and returns its top-level or nested keys, with automatic exclusion of common system keys.

Endpoint: POST /utils/get_yaml_keys

Request Body:

{
  "file_path": "/config/metadata_mapping.yaml",
  "parent_key": "zenodo_mappings"
}

Request Parameters:

  • file_path (string, required): Absolute path to the YAML file
  • parent_key (string, optional): Parent key to retrieve nested keys from. If omitted, returns top-level keys

Response (Top-Level Keys):

{
  "keys": [
    "metadata_extraction",
    "description_generation",
    "keyword_suggestion"
  ]
}

Response (Nested Keys):

{
  "keys": [
    "title_mapping",
    "creator_mapping",
    "date_mapping"
  ]
}

Key Exclusion:

When retrieving top-level keys, the following system keys are automatically excluded:

  • settings
  • version
  • sources

Status Codes:

  • 200 OK: Keys retrieved successfully
  • 404 Not Found: File not found at specified path
  • 500 Internal Server Error: YAML parsing error or unexpected exception

Get Table Headers

Reads a CSV, TSV, or Excel file and returns its column headers.

Endpoint: POST /utils/get_table_headers

Request Body:

{
  "file_path": "/data/metadata.csv"
}

Request Parameters:

  • file_path (string, required): Absolute path to the table file

Response:

{
  "success": true,
  "headers": [
    "ID",
    "Title",
    "Creator",
    "Date",
    "Description",
    "Keywords"
  ]
}

Supported File Types:

  • Excel: .xls, .xlsx (uses openpyxl engine)
  • TSV: .tsv (tab-separated values)
  • CSV: All other extensions, defaults to comma-separated

Status Codes:

  • 200 OK: Headers retrieved successfully
  • 404 Not Found: File not found at specified path
  • 500 Internal Server Error: Failed to read file (corrupt file, encoding issues, etc.)

Get JSON Keys

Parses a JSON file and returns its top-level keys in sorted order.

Endpoint: POST /utils/get_json_keys

Request Body:

{
  "file_path": "/output/pipeline_results.json"
}

Request Parameters:

  • file_path (string, required): Absolute path to the JSON file

Response:

{
  "success": true,
  "keys": [
    "metadata",
    "results",
    "summary",
    "timestamp",
    "validation_status"
  ]
}

Validation:

The endpoint validates that the JSON file contains a dictionary (object) at the root level. Array-based JSON files will return an error.

Status Codes:

  • 200 OK: Keys retrieved successfully
  • 400 Bad Request: Missing file_path parameter or JSON root is not a dictionary
  • 404 Not Found: File not found at specified path
  • 500 Internal Server Error: Invalid JSON format or file read error

Preview Spreadsheet

Reads and returns the first 5 rows of a CSV or Excel file for preview purposes.

Endpoint: POST /utils/preview_spreadsheet

Request Body:

{
  "filePath": "/data/metadata.xlsx"
}

Request Parameters:

  • filePath (string, required): Absolute path to the spreadsheet file

Response:

{
  "success": true,
  "columns": [
    "ID",
    "Title",
    "Creator"
  ],
  "previewData": [
    {
      "ID": 1,
      "Title": "Medieval Manuscript",
      "Creator": "Unknown"
    },
    {
      "ID": 2,
      "Title": "Ancient Artifact",
      "Creator": "Smith, J."
    }
  ]
}

Response Fields:

  • success (boolean): Operation success status
  • columns (array): List of column headers
  • previewData (array): Array of row objects (maximum 5 rows), with column headers as keys

Data Format:

The previewData uses orient="records" format from pandas, where each row is a dictionary with column names as keys.

Status Codes:

  • 200 OK: Preview generated successfully
  • 404 Not Found: File not found at specified path
  • 500 Internal Server Error: Error reading file

Template Processing

Get Template Variables

Extracts all placeholder variables from a template file using pattern matching.

Endpoint: POST /utils/get_template_variables

Request Body:

{
  "file_path": "/templates/description_template.txt"
}

Request Parameters:

  • file_path (string, required): Absolute path to the template file

Response:

{
  "success": true,
  "variables": [
    "creator",
    "date",
    "description",
    "title"
  ]
}

Pattern Matching:

The endpoint searches for placeholders in the format {variable}, extracting alphanumeric variable names including underscores. Variables are returned in sorted, deduplicated order.

Supported Formats:

  • {variable_name}
  • {VariableName123}

Status Codes:

  • 200 OK: Variables extracted successfully
  • 404 Not Found: File not found or path not provided
  • 500 Internal Server Error: File read error

Get Prompt Placeholders

Parses a prompts YAML file and finds all unique {placeholder} strings within a specific prompt configuration.

Endpoint: POST /utils/get_prompt_placeholders

Request Body:

{
  "file_path": "/config/prompts.yaml",
  "prompt_id": "metadata_extraction"
}

Request Parameters:

  • file_path (string, required): Absolute path to the prompts YAML file
  • prompt_id (string, required): Identifier of the prompt section to analyze

Response:

{
  "success": true,
  "placeholders": [
    "creator",
    "filename",
    "format",
    "title"
  ]
}

Search Scope:

The endpoint searches for placeholders within the system and user fields of all prompt variations under the specified prompt_id. Placeholders are extracted using regex pattern \{(\w+)\} and returned in sorted order.

Status Codes:

  • 200 OK: Placeholders extracted successfully
  • 400 Bad Request: Missing file_path or prompt_id, or file not found
  • 500 Internal Server Error: YAML parsing error or unexpected exception

Get Prompt Content

Retrieves the system prompt, user prompt, and suggested model from a prompts YAML file.

Endpoint: POST /utils/get_prompt_content

Request Body:

{
  "file_path": "/config/prompts.yaml",
  "prompt_id": "metadata_extraction",
  "prompt_key": "default"
}

Request Parameters:

  • file_path (string, required): Absolute path to the prompts YAML file
  • prompt_id (string, required): Top-level prompt identifier
  • prompt_key (string, required): Sub-key within the prompt ID (e.g., default, detailed, concise)

Response:

{
  "success": true,
  "system": "You are an expert metadata curator specializing in cultural heritage data.",
  "user": "Extract metadata from the following file: {filename}. The file contains information about {title}.",
  "suggested_model": "llama3.1:8b"
}

Response Fields:

  • success (boolean): Operation success status
  • system (string): System prompt text, or "Not defined." if missing
  • user (string): User prompt template text, or "Not defined." if missing
  • suggested_model (string, nullable): Recommended model identifier, or null if not specified

Status Codes:

  • 200 OK: Prompt content retrieved successfully
  • 400 Bad Request: Missing required fields
  • 500 Internal Server Error: File read or YAML parsing error

Schema Processing

Get Template Mapping Info

Analyzes a template file against XML Schema Definition (XSD) files to extract controlled vocabularies and element type mappings.

Endpoint: POST /utils/get_template_mapping_info

Request Body:

{
  "schema_dir": "/schemas/edm",
  "template_file": "/templates/edm_template.xml"
}

Request Parameters:

  • schema_dir (string, required): Path to directory containing .xsd schema files
  • template_file (string, required): Path to template file containing ${variable} placeholders

Response:

{
  "success": true,
  "variables": [
    {
      "name": "dc_type",
      "has_vocab": true,
      "vocab_values": [
        "TEXT",
        "IMAGE",
        "VIDEO",
        "SOUND",
        "3D"
      ]
    },
    {
      "name": "dc_creator",
      "has_vocab": false,
      "vocab_values": []
    }
  ]
}

Response Fields:

  • success (boolean): Operation success status
  • variables (array): List of template variable objects
  • name (string): Variable name extracted from template
  • has_vocab (boolean): Whether a controlled vocabulary exists for this variable
  • vocab_values (array): List of allowed values if controlled vocabulary exists

Processing Logic:

The endpoint performs the following steps:

  1. Schema Parsing: Parses all .xsd files in the schema directory to build type definitions and element-to-type mappings
  2. Variable Extraction: Extracts variables from template using pattern ${variable_name}
  3. Namespace Resolution: Splits variables like dc_creator into prefix (dc) and local name (creator)
  4. Type Lookup: Maps element names to their schema types
  5. Vocabulary Extraction: Retrieves enumeration values for simple types with controlled vocabularies

Supported Namespace Prefixes:

  • dc: Dublin Core Elements (http://purl.org/dc/elements/1.1/)
  • edm: Europeana Data Model (http://www.europeana.eu/schemas/edm/)
  • dcterms: Dublin Core Terms (http://purl.org/dc/terms/)
  • skos: Simple Knowledge Organization System (http://www.w3.org/2004/02/skos/core#)

Dependency:

Requires lxml library to be installed. If not available, logs a warning and returns empty mappings.

Status Codes:

  • 200 OK: Template analyzed successfully
  • 400 Bad Request: Missing schema_dir or template_file
  • 500 Internal Server Error: Schema parsing error or template read error

Intelligent Mapping

Automap Zenodo Fields

Automatically suggests mappings from table column headers to Zenodo metadata fields using keyword matching and similarity algorithms.

Endpoint: POST /utils/automap_zenodo_fields

Request Body:

{
  "headers": [
    "Title",
    "Author Name",
    "Publication Date",
    "Summary",
    "Tags"
  ]
}

Request Parameters:

  • headers (array, required): List of column header names from a table

Response:

{
  "success": true,
  "mapping": {
    "title": "Title",
    "creators": "Author Name",
    "publication_date": "Publication Date",
    "description": "Summary",
    "keywords": "Tags"
  }
}

Response Fields:

  • success (boolean): Operation success status
  • mapping (object): Dictionary mapping Zenodo field names to table column headers

Mapping Algorithm:

The endpoint uses a two-level matching strategy:

Level 1 - Keyword Matching: Exact match of normalized header against predefined keywords for each Zenodo field

Level 2 - Similarity Matching: For remaining unmapped fields, uses difflib.SequenceMatcher with threshold of 0.8 to find best matches

Zenodo Field Keywords:

  • title: ["title", "headline", "name"]
  • description: ["description", "summary", "abstract"]
  • creators: ["author", "creator", "artist", "writer"]
  • publication_date: ["date", "publicationdate", "pub_date", "year"]
  • keywords: ["keywords", "tags", "subjects"]

Normalization:

Headers and keywords are normalized by converting to lowercase and removing spaces and underscores.

Status Codes:

  • 200 OK: Mapping generated successfully
  • 400 Bad Request: No headers provided

Automap Columns

Intelligently maps table columns to prompt placeholders using a sophisticated four-level matching algorithm.

Endpoint: POST /utils/automap_columns

Request Body:

{
  "table_path": "/data/metadata.csv",
  "prompts_path": "/config/prompts.yaml",
  "prompt_id": "metadata_extraction"
}

Request Parameters:

  • table_path (string, required): Path to CSV or Excel file containing data
  • prompts_path (string, required): Path to YAML file containing prompts
  • prompt_id (string, required): Identifier of the prompt to analyze

Response:

{
  "success": true,
  "mapping": {
    "title": "Title",
    "creator": "Author",
    "date": "Creation_Date",
    "description": "Summary"
  }
}

Response Fields:

  • success (boolean): Operation success status
  • mapping (object): Dictionary mapping placeholder names to table column headers

Four-Level Matching Algorithm:

The endpoint employs an optimized, layered matching strategy:

Level 1 - Exact Match: Direct string equality between placeholder and header (most efficient)

Level 2 - Normalized Match: Comparison after converting to lowercase and removing spaces/underscores

Level 3 - Plural/Singular Match: Handles simple plural forms by adding/removing trailing 's'

Level 4 - Similarity Match: Uses difflib.SequenceMatcher with threshold of 0.8 for fuzzy matching

Performance Optimizations:

  • Pre-normalizes all strings once for efficiency
  • Uses compiled regex pattern cached at function level
  • Optimizes text concatenation using join()
  • Early returns when no placeholders or headers found
  • Tracks used headers to prevent duplicate mappings

Status Codes:

  • 200 OK: Mapping completed successfully
  • 400 Bad Request: Missing required paths, prompt ID, or invalid YAML format
  • 404 Not Found: File not found
  • 500 Internal Server Error: Unexpected error during mapping

External Service Integration

List Ollama Models

Lists available Large Language Models from the Ollama service.

Endpoint: GET /utils/list_ollama_models

Response:

{
  "success": true,
  "models": [
    "llama3.1:8b",
    "llama3.1:70b",
    "mistral:7b",
    "codellama:13b"
  ]
}

Response Fields:

  • success (boolean): Operation success status
  • models (array): List of model names available in Ollama

Error Responses:

Library Not Installed:

{
  "success": false,
  "error": "Ollama library not installed on the server."
}

Service Not Reachable:

{
  "success": false,
  "error": "Ollama not reachable.",
  "details": "Connection refused: localhost:11434"
}

Dependencies:

  • Requires ollama Python library to be installed
  • Requires Ollama service to be running and accessible

Status Codes:

  • 200 OK: Models retrieved successfully or error message returned

Error Handling

Common Error Scenarios

File Not Found:

POST /utils/get_yaml_keys
{"file_path": "/nonexistent/file.yaml"}

Response: 404 Not Found

{
  "error": "File not found"
}

Missing Required Parameters:

POST /utils/get_prompt_content
{"file_path": "/config/prompts.yaml"}

Response: 400 Bad Request

{
  "error": "Missing required fields"
}

Invalid JSON Structure:

POST /utils/get_json_keys
{"file_path": "/data/array.json"}

Where array.json contains ``

Response: 400 Bad Request

{
  "success": false,
  "error": "The provided file is not a JSON object (dictionary)."
}

YAML Parsing Error:

Response: 500 Internal Server Error

{
  "success": false,
  "error": "Invalid YAML format: ..."
}

Usage Examples

Example 1: Extract Placeholders and Auto-Map Columns

Step 1: Get placeholders from prompts

POST /utils/get_prompt_placeholders
Content-Type: application/json

{
  "file_path": "/config/prompts.yaml",
  "prompt_id": "description_generation"
}

Step 2: Auto-map to table columns

POST /utils/automap_columns
Content-Type: application/json

{
  "table_path": "/data/artifacts.csv",
  "prompts_path": "/config/prompts.yaml",
  "prompt_id": "description_generation"
}

Example 2: Preview Spreadsheet and Map to Zenodo

POST /utils/preview_spreadsheet
Content-Type: application/json

{"filePath": "/data/metadata.xlsx"}

Then use the returned headers:

POST /utils/automap_zenodo_fields
Content-Type: application/json

{
  "headers": ["Title", "Author", "Date", "Abstract"]
}

Example 3: Analyze Template with Schema Vocabularies

POST /utils/get_template_mapping_info
Content-Type: application/json

{
  "schema_dir": "/schemas/europeana",
  "template_file": "/templates/edm_record.xml"
}

Example 4: List Available LLM Models

GET /utils/list_ollama_models

Logging

Application Logger

The module uses Python's standard logging with logger name utils_bp:

Error Level:

  • "Error in get_yaml_keys: {exception}"
  • "Error in get_table_headers: {exception}"
  • "Error in get_prompt_placeholders: {exception}"
  • "Error in preview_spreadsheet: {exception}"
  • "Failed to get JSON keys from {path}: {exception}"
  • "Failed to get template mapping info: {exception}"
  • "Automapping failed: {exception}"
  • "Failed to get template variables: {exception}"

Warning Level:

  • "lxml is not installed. Cannot parse XML schemas."
  • "Could not parse schema file {filename}: {exception}"

All error logs include full stack traces with exc_info=True.


Dependencies

Required Libraries

  • pandas: For reading CSV/Excel files
  • pyyaml: For parsing YAML configuration files
  • pathlib: For cross-platform path handling
  • difflib: For similarity-based string matching
  • re: For regex pattern matching
  • json: For JSON file parsing

Optional Libraries

  • ollama: For LLM model integration (gracefully degrades if not installed)
  • lxml: For XML schema parsing (falls back to empty results if not installed)