Skip to content

Data Query API Reference

The Heritage Data Processor Data Query API provides read-only endpoints for retrieving project data, including files, Zenodo records, pipeline configurations, batches, API logs, credentials, and file hierarchies.

Base URL

All endpoints use the base path defined by the Blueprint mounting point.


Project Requirement

Project Context Required

All endpoints in this API require an active HDPC project to be loaded. They use the @project_required decorator, which returns a 400 Bad Request error if no project is loaded.


File Management

List Source Files

Retrieves a paginated list of source files from the project database with optional search filtering.

Endpoint: GET /files

Query Parameters:

  • page (integer, optional): Page number for pagination. Defaults to 1
  • limit (integer, optional): Number of items per page. Defaults to 25
  • search (string, optional): Search term to filter files by filename using partial matching

Response:

{
  "items": [
    {
      "filename": "manuscript_001.xml",
      "relative_path": "data/manuscripts/manuscript_001.xml",
      "size_bytes": 45120,
      "mime_type": "application/xml",
      "file_type": "xml",
      "status": "processed"
    },
    {
      "filename": "manuscript_002.xml",
      "relative_path": "data/manuscripts/manuscript_002.xml",
      "size_bytes": 38945,
      "mime_type": "application/xml",
      "file_type": "xml",
      "status": "pending"
    }
  ],
  "totalItems": 150,
  "page": 1,
  "totalPages": 6
}

Response Fields:

  • items (array): List of source file objects for the current page
  • filename (string): Name of the file
  • relative_path (string): Path relative to the project root
  • size_bytes (integer): File size in bytes
  • mime_type (string): MIME type of the file
  • file_type (string): Classification of the file type
  • status (string): Processing status of the file
  • totalItems (integer): Total number of files matching the search criteria
  • page (integer): Current page number
  • totalPages (integer): Total number of pages based on the limit

Search Behavior:

The search parameter performs case-insensitive partial matching on the filename field using SQL LIKE with wildcards (%search%).

Status Codes:

  • 200 OK: Files retrieved successfully
  • 400 Bad Request: No project loaded

Get File Hierarchy

Recursively retrieves a file and all its child files (dependencies, associated files) to build a complete hierarchical tree structure.

Endpoint: GET /files/<file_id>/hierarchy

URL Parameters:

  • file_id (integer, required): Database ID of the root file

Response:

{
  "file_id": 42,
  "filename": "model.obj",
  "absolute_path": "/project/models/model.obj",
  "file_type": "obj",
  "status": "processed",
  "error_message": null,
  "children": [
    {
      "file_id": 43,
      "filename": "model.mtl",
      "absolute_path": "/project/models/model.mtl",
      "file_type": "mtl",
      "status": "processed",
      "error_message": null,
      "children": [
        {
          "file_id": 44,
          "filename": "texture.png",
          "absolute_path": "/project/textures/texture.png",
          "file_type": "texture",
          "status": "processed",
          "error_message": null,
          "children": []
        }
      ]
    }
  ]
}

Response Fields:

  • file_id (integer): Database identifier for the file
  • filename (string): Name of the file
  • absolute_path (string): Full filesystem path to the file
  • file_type (string): Classification of the file type
  • status (string): Processing status of the file
  • error_message (string, nullable): Error message if processing failed, otherwise null
  • children (array): Recursive array of child file objects with the same structure

Hierarchy Logic:

The endpoint uses the parent_file_id foreign key relationship in the source_files table to build the tree. Children are sorted alphabetically by filename.

Status Codes:

  • 200 OK: Hierarchy retrieved successfully
  • 400 Bad Request: No project loaded
  • 404 Not Found: File with the specified ID does not exist
  • 500 Internal Server Error: Database error during recursive fetch

Use Case:

This endpoint is designed for displaying file dependencies in modals or tree views, such as 3D models with their material files and textures.


Zenodo Integration

Get Latest Zenodo Record

Retrieves the most recently updated Zenodo record metadata from the project database.

Endpoint: GET /zenodo_record

Response:

{
  "record_title": "Medieval Manuscript Collection - Volume 1",
  "zenodo_doi": "10.5281/zenodo.1234567",
  "record_status": "published",
  "record_metadata_json": "{\"title\": \"Medieval Manuscript Collection\", \"creators\": [...]}",
  "version": "1.2.0"
}

Response Fields:

  • record_title (string): Title of the Zenodo record
  • zenodo_doi (string): Digital Object Identifier assigned by Zenodo
  • record_status (string): Current status of the record (e.g., draft, published)
  • record_metadata_json (string): JSON-serialized metadata object containing full Zenodo metadata
  • version (string): Version number of the record

Empty Response:

If no Zenodo records exist in the project, the endpoint returns an empty object {}.

Ordering:

Records are ordered by last_updated_timestamp in descending order, ensuring the most recent record is returned.

Status Codes:

  • 200 OK: Record retrieved successfully (or empty object if no records exist)
  • 400 Bad Request: No project loaded

List Record Files

Retrieves all files associated with a specific Zenodo record, including their upload status and metadata.

Endpoint: GET /records/<record_id>/files

URL Parameters:

  • record_id (integer, required): Database ID of the Zenodo record

Response:

[
  {
    "file_id": 101,
    "filename": "manuscript_001.xml",
    "absolute_path": "/project/data/manuscript_001.xml",
    "file_type": "xml",
    "pipeline_source": "xml_processor",
    "step_source": "validation",
    "upload_status": "uploaded"
  },
  {
    "file_id": 102,
    "filename": "metadata.json",
    "absolute_path": "/project/output/metadata.json",
    "file_type": "json",
    "pipeline_source": "metadata_generator",
    "step_source": "generation",
    "upload_status": "pending"
  }
]

Response Fields:

  • file_id (integer): Database identifier for the file
  • filename (string): Name of the file
  • absolute_path (string): Full filesystem path to the file
  • file_type (string): Classification of the file type
  • pipeline_source (string): Name of the pipeline component that produced this file
  • step_source (string): Specific processing step that generated the file
  • upload_status (string): Current upload status (e.g., pending, uploaded, failed)

Ordering:

Files are sorted by file_type in descending order, then by filename in ascending order.

Database Join:

This endpoint joins the record_files_map table with the source_files table to retrieve comprehensive file information.

Status Codes:

  • 200 OK: Files retrieved successfully
  • 400 Bad Request: No project loaded
  • 500 Internal Server Error: Database query failed

Pipeline Configuration

Get Pipeline Steps

Retrieves all configured pipeline steps for the project, ordered by modality and execution order.

Endpoint: GET /pipeline_steps

Response:

[
  {
    "modality": "text",
    "component_name": "xml_validator",
    "component_order": 1,
    "is_active": 1,
    "parameters": "{\"schema\": \"tei_all\", \"strict\": true}"
  },
  {
    "modality": "text",
    "component_name": "metadata_extractor",
    "component_order": 2,
    "is_active": 1,
    "parameters": "{\"format\": \"json\"}"
  },
  {
    "modality": "image",
    "component_name": "image_processor",
    "component_order": 1,
    "is_active": 0,
    "parameters": "{\"resize\": true, \"quality\": 95}"
  }
]

Response Fields:

  • modality (string): Data modality or category (e.g., text, image, 3d_model)
  • component_name (string): Name of the pipeline component
  • component_order (integer): Execution order within the modality
  • is_active (integer): Boolean flag indicating whether the component is active (1) or disabled (0)
  • parameters (string): JSON-serialized parameters for the component

Ordering:

Pipeline steps are ordered first by modality, then by component_order to reflect the execution sequence.

Empty Response:

If no pipeline steps are configured, the endpoint returns an empty array [].

Status Codes:

  • 200 OK: Pipeline steps retrieved successfully
  • 400 Bad Request: No project loaded

Project Configuration

Get Configuration Settings

Retrieves all project-level configuration key-value pairs.

Endpoint: GET /configuration

Response:

[
  {
    "config_key": "project_name",
    "config_value": "Medieval Manuscripts Archive"
  },
  {
    "config_key": "default_output_format",
    "config_value": "xml"
  },
  {
    "config_key": "enable_auto_backup",
    "config_value": "true"
  }
]

Response Fields:

  • config_key (string): Configuration parameter name
  • config_value (string): Configuration parameter value (stored as string regardless of actual data type)

Empty Response:

If no configuration entries exist, the endpoint returns an empty array [].

Status Codes:

  • 200 OK: Configuration retrieved successfully
  • 400 Bad Request: No project loaded

Batch Processing

List Batches

Retrieves all processing batches created in the project, ordered by creation timestamp.

Endpoint: GET /batches

Response:

[
  {
    "batch_name": "Manuscript Validation - January 2025",
    "batch_description": "Validation of all manuscript files received in January",
    "status": "completed",
    "created_timestamp": "2025-01-15T10:30:00Z"
  },
  {
    "batch_name": "Image Processing - February 2025",
    "batch_description": "Batch processing of scanned images",
    "status": "in_progress",
    "created_timestamp": "2025-02-01T08:00:00Z"
  }
]

Response Fields:

  • batch_name (string): Descriptive name of the batch
  • batch_description (string): Detailed description of the batch purpose
  • status (string): Current processing status (e.g., pending, in_progress, completed, failed)
  • created_timestamp (string): ISO 8601 timestamp of batch creation

Ordering:

Batches are sorted by created_timestamp in descending order, with the most recent batches first.

Empty Response:

If no batches exist, the endpoint returns an empty array [].

Status Codes:

  • 200 OK: Batches retrieved successfully
  • 400 Bad Request: No project loaded

API Logging

List API Log Entries

Retrieves paginated API activity logs showing HTTP requests made to external services (e.g., Zenodo).

Endpoint: GET /apilog

Query Parameters:

  • page (integer, optional): Page number for pagination. Defaults to 1
  • limit (integer, optional): Number of items per page. Defaults to 25

Response:

{
  "items": [
    {
      "timestamp": "2025-10-21T11:45:23Z",
      "http_method": "POST",
      "endpoint_url": "https://zenodo.org/api/deposit/depositions",
      "response_status_code": 201,
      "status": "success"
    },
    {
      "timestamp": "2025-10-21T11:40:15Z",
      "http_method": "GET",
      "endpoint_url": "https://zenodo.org/api/deposit/depositions/1234567",
      "response_status_code": 200,
      "status": "success"
    },
    {
      "timestamp": "2025-10-21T11:35:08Z",
      "http_method": "PUT",
      "endpoint_url": "https://zenodo.org/api/deposit/depositions/1234567/files",
      "response_status_code": 500,
      "status": "failed"
    }
  ],
  "totalItems": 487,
  "page": 1,
  "totalPages": 20
}

Response Fields:

  • items (array): List of API log entries for the current page
  • timestamp (string): ISO 8601 timestamp of the API request
  • http_method (string): HTTP method used (e.g., GET, POST, PUT, DELETE)
  • endpoint_url (string): Full URL of the external API endpoint
  • response_status_code (integer): HTTP status code returned by the external API
  • status (string): Interpreted status of the request (e.g., success, failed)
  • totalItems (integer): Total number of log entries
  • page (integer): Current page number
  • totalPages (integer): Total number of pages based on the limit

Ordering:

Log entries are sorted by timestamp in descending order, with the most recent entries first.

Status Codes:

  • 200 OK: Log entries retrieved successfully
  • 400 Bad Request: No project loaded

Credentials Management

List API Credentials

Retrieves all stored API credentials for external services, without exposing sensitive credential values.

Endpoint: GET /credentials

Response:

[
  {
    "credential_name": "Zenodo Production",
    "credential_type": "zenodo_api_token",
    "is_sandbox": 0
  },
  {
    "credential_name": "Zenodo Sandbox",
    "credential_type": "zenodo_api_token",
    "is_sandbox": 1
  }
]

Response Fields:

  • credential_name (string): Human-readable name for the credential
  • credential_type (string): Type or category of the credential (e.g., zenodo_api_token, oauth_token)
  • is_sandbox (integer): Boolean flag indicating whether this credential is for a sandbox environment (1) or production (0)

Security:

This endpoint does not return actual credential values (API keys, tokens, passwords) for security reasons.

Ordering:

Credentials are sorted alphabetically by credential_name.

Empty Response:

If no credentials are stored, the endpoint returns an empty array [].

Status Codes:

  • 200 OK: Credentials retrieved successfully
  • 400 Bad Request: No project loaded

Pagination

Pagination Format

Endpoints that support pagination (/files and /apilog) use a consistent pagination structure:

Query Parameters:

  • page: 1-indexed page number (defaults to 1)
  • limit: Number of items per page (defaults to 25)

Response Format:

{
  "items": [...],
  "totalItems": 150,
  "page": 2,
  "totalPages": 6
}

Page Calculation:

The totalPages value is calculated using: (totalItems + limit - 1) // limit, ensuring at least 1 page even when totalItems is 0.

Offset Calculation:

The database offset is calculated as: (page - 1) * limit.


Error Responses

Standard Error Format

All error responses follow a consistent JSON format:

{
  "error": "Descriptive error message"
}

Common Error Scenarios

No Project Loaded:

All endpoints return 400 Bad Request when no HDPC project is loaded due to the @project_required decorator.

File Not Found:

GET /files/99999/hierarchy

Response: 404 Not Found with error message "File not found"

Database Query Failed:

GET /records/123/files

Response: 500 Internal Server Error with error message "Database query failed"

General Exception:

The /files/<file_id>/hierarchy endpoint catches all exceptions and returns 500 Internal Server Error with the exception message.


Database Integration

Query Service

All endpoints use the query_db function from the database service to execute SQL queries against the project SQLite database.

Connection Management:

Most endpoints use query_db, which handles connection lifecycle automatically. The /files/<file_id>/hierarchy endpoint uses get_db_connection directly for manual connection management due to its recursive nature.

Database Path:

All queries execute against project_manager.db_path, which points to the currently loaded project's database file.

Row Factory

Query results are returned as dictionaries with column names as keys, making them directly serializable to JSON.


Usage Examples

Example 1: Search Files with Pagination

GET /files?page=2&limit=10&search=manuscript

Response:

{
  "items": [
    {
      "filename": "manuscript_011.xml",
      "relative_path": "data/manuscripts/manuscript_011.xml",
      "size_bytes": 42300,
      "mime_type": "application/xml",
      "file_type": "xml",
      "status": "processed"
    }
  ],
  "totalItems": 35,
  "page": 2,
  "totalPages": 4
}

Example 2: Get Complete File Hierarchy

GET /files/42/hierarchy

Response:

{
  "file_id": 42,
  "filename": "scene.obj",
  "absolute_path": "/project/3d/scene.obj",
  "file_type": "obj",
  "status": "processed",
  "error_message": null,
  "children": [
    {
      "file_id": 43,
      "filename": "scene.mtl",
      "absolute_path": "/project/3d/scene.mtl",
      "file_type": "mtl",
      "status": "processed",
      "error_message": null,
      "children": [
        {
          "file_id": 44,
          "filename": "diffuse.jpg",
          "absolute_path": "/project/3d/textures/diffuse.jpg",
          "file_type": "texture",
          "status": "processed",
          "error_message": null,
          "children": []
        }
      ]
    }
  ]
}

Example 3: Monitor API Activity

GET /apilog?page=1&limit=5

Response:

{
  "items": [
    {
      "timestamp": "2025-10-21T13:30:00Z",
      "http_method": "POST",
      "endpoint_url": "https://sandbox.zenodo.org/api/deposit/depositions",
      "response_status_code": 201,
      "status": "success"
    }
  ],
  "totalItems": 142,
  "page": 1,
  "totalPages": 29
}

Example 4: View Pipeline Configuration

GET /pipeline_steps

Response:

[
  {
    "modality": "text",
    "component_name": "tei_validator",
    "component_order": 1,
    "is_active": 1,
    "parameters": "{\"encoding\": \"utf-8\", \"validate_schema\": true}"
  },
  {
    "modality": "text",
    "component_name": "metadata_enricher",
    "component_order": 2,
    "is_active": 1,
    "parameters": "{\"add_timestamps\": true}"
  }
]

Data Types & Constraints

Integer Fields

All integer fields (file_id, record_id, page, limit, size_bytes, etc.) are parsed and validated according to their SQL column types.

Boolean Fields

Boolean values are stored as integers in SQLite: 0 for false, 1 for true (e.g., is_active, is_sandbox).

Timestamps

All timestamp fields follow ISO 8601 format (e.g., 2025-10-21T13:30:00Z).

JSON Fields

Parameters and metadata fields are stored as JSON-serialized strings and must be parsed by the client.