Data Query API Reference¶

The Heritage Data Processor Data Query API provides read-only endpoints for retrieving project data, including files, Zenodo records, pipeline configurations, batches, API logs, credentials, and file hierarchies.

Base URL¶

All endpoints use the base path defined by the Blueprint mounting point.

Project Requirement¶

Project Context Required

All endpoints in this API require an active HDPC project to be loaded. They use the @project_required decorator, which returns a 400 Bad Request error if no project is loaded.

File Management¶

List Source Files¶

Retrieves a paginated list of source files from the project database with optional search filtering.

Endpoint: GET /files

Query Parameters:

page (integer, optional): Page number for pagination. Defaults to 1
limit (integer, optional): Number of items per page. Defaults to 25
search (string, optional): Search term to filter files by filename using partial matching

Response:

{
  "items": [
    {
      "filename": "manuscript_001.xml",
      "relative_path": "data/manuscripts/manuscript_001.xml",
      "size_bytes": 45120,
      "mime_type": "application/xml",
      "file_type": "xml",
      "status": "processed"
    },
    {
      "filename": "manuscript_002.xml",
      "relative_path": "data/manuscripts/manuscript_002.xml",
      "size_bytes": 38945,
      "mime_type": "application/xml",
      "file_type": "xml",
      "status": "pending"
    }
  ],
  "totalItems": 150,
  "page": 1,
  "totalPages": 6
}

Response Fields:

items (array): List of source file objects for the current page
filename (string): Name of the file
relative_path (string): Path relative to the project root
size_bytes (integer): File size in bytes
mime_type (string): MIME type of the file
file_type (string): Classification of the file type
status (string): Processing status of the file
totalItems (integer): Total number of files matching the search criteria
page (integer): Current page number
totalPages (integer): Total number of pages based on the limit

Search Behavior:

The search parameter performs case-insensitive partial matching on the filename field using SQL LIKE with wildcards (%search%).

Status Codes:

200 OK: Files retrieved successfully
400 Bad Request: No project loaded

Get File Hierarchy¶

Recursively retrieves a file and all its child files (dependencies, associated files) to build a complete hierarchical tree structure.

Endpoint: GET /files/<file_id>/hierarchy

URL Parameters:

file_id (integer, required): Database ID of the root file

Response:

{
  "file_id": 42,
  "filename": "model.obj",
  "absolute_path": "/project/models/model.obj",
  "file_type": "obj",
  "status": "processed",
  "error_message": null,
  "children": [
    {
      "file_id": 43,
      "filename": "model.mtl",
      "absolute_path": "/project/models/model.mtl",
      "file_type": "mtl",
      "status": "processed",
      "error_message": null,
      "children": [
        {
          "file_id": 44,
          "filename": "texture.png",
          "absolute_path": "/project/textures/texture.png",
          "file_type": "texture",
          "status": "processed",
          "error_message": null,
          "children": []
        }
      ]
    }
  ]
}

Response Fields:

file_id (integer): Database identifier for the file
filename (string): Name of the file
absolute_path (string): Full filesystem path to the file
file_type (string): Classification of the file type
status (string): Processing status of the file
error_message (string, nullable): Error message if processing failed, otherwise null
children (array): Recursive array of child file objects with the same structure

Hierarchy Logic:

The endpoint uses the parent_file_id foreign key relationship in the source_files table to build the tree. Children are sorted alphabetically by filename.

Status Codes:

200 OK: Hierarchy retrieved successfully
400 Bad Request: No project loaded
404 Not Found: File with the specified ID does not exist
500 Internal Server Error: Database error during recursive fetch

Use Case:

This endpoint is designed for displaying file dependencies in modals or tree views, such as 3D models with their material files and textures.

Zenodo Integration¶

Get Latest Zenodo Record¶

Retrieves the most recently updated Zenodo record metadata from the project database.

Endpoint: GET /zenodo_record

Response:

{
  "record_title": "Medieval Manuscript Collection - Volume 1",
  "zenodo_doi": "10.5281/zenodo.1234567",
  "record_status": "published",
  "record_metadata_json": "{\"title\": \"Medieval Manuscript Collection\", \"creators\": [...]}",
  "version": "1.2.0"
}

Response Fields:

record_title (string): Title of the Zenodo record
zenodo_doi (string): Digital Object Identifier assigned by Zenodo
record_status (string): Current status of the record (e.g., draft, published)
record_metadata_json (string): JSON-serialized metadata object containing full Zenodo metadata
version (string): Version number of the record

Empty Response:

If no Zenodo records exist in the project, the endpoint returns an empty object {}.

Ordering:

Records are ordered by last_updated_timestamp in descending order, ensuring the most recent record is returned.

Status Codes:

200 OK: Record retrieved successfully (or empty object if no records exist)
400 Bad Request: No project loaded

List Record Files¶

Retrieves all files associated with a specific Zenodo record, including their upload status and metadata.

Endpoint: GET /records/<record_id>/files

URL Parameters:

record_id (integer, required): Database ID of the Zenodo record

Response:

[
  {
    "file_id": 101,
    "filename": "manuscript_001.xml",
    "absolute_path": "/project/data/manuscript_001.xml",
    "file_type": "xml",
    "pipeline_source": "xml_processor",
    "step_source": "validation",
    "upload_status": "uploaded"
  },
  {
    "file_id": 102,
    "filename": "metadata.json",
    "absolute_path": "/project/output/metadata.json",
    "file_type": "json",
    "pipeline_source": "metadata_generator",
    "step_source": "generation",
    "upload_status": "pending"
  }
]

Response Fields:

file_id (integer): Database identifier for the file
filename (string): Name of the file
absolute_path (string): Full filesystem path to the file
file_type (string): Classification of the file type
pipeline_source (string): Name of the pipeline component that produced this file
step_source (string): Specific processing step that generated the file
upload_status (string): Current upload status (e.g., pending, uploaded, failed)

Ordering:

Files are sorted by file_type in descending order, then by filename in ascending order.

Database Join:

This endpoint joins the record_files_map table with the source_files table to retrieve comprehensive file information.

Status Codes:

200 OK: Files retrieved successfully
400 Bad Request: No project loaded
500 Internal Server Error: Database query failed

Pipeline Configuration¶

Get Pipeline Steps¶

Retrieves all configured pipeline steps for the project, ordered by modality and execution order.

Endpoint: GET /pipeline_steps

Response:

[
  {
    "modality": "text",
    "component_name": "xml_validator",
    "component_order": 1,
    "is_active": 1,
    "parameters": "{\"schema\": \"tei_all\", \"strict\": true}"
  },
  {
    "modality": "text",
    "component_name": "metadata_extractor",
    "component_order": 2,
    "is_active": 1,
    "parameters": "{\"format\": \"json\"}"
  },
  {
    "modality": "image",
    "component_name": "image_processor",
    "component_order": 1,
    "is_active": 0,
    "parameters": "{\"resize\": true, \"quality\": 95}"
  }
]

Response Fields:

modality (string): Data modality or category (e.g., text, image, 3d_model)
component_name (string): Name of the pipeline component
component_order (integer): Execution order within the modality
is_active (integer): Boolean flag indicating whether the component is active (1) or disabled (0)
parameters (string): JSON-serialized parameters for the component

Ordering:

Pipeline steps are ordered first by modality, then by component_order to reflect the execution sequence.

Empty Response:

If no pipeline steps are configured, the endpoint returns an empty array [].

Status Codes:

200 OK: Pipeline steps retrieved successfully
400 Bad Request: No project loaded

Project Configuration¶

Get Configuration Settings¶

Retrieves all project-level configuration key-value pairs.

Endpoint: GET /configuration

Response:

[
  {
    "config_key": "project_name",
    "config_value": "Medieval Manuscripts Archive"
  },
  {
    "config_key": "default_output_format",
    "config_value": "xml"
  },
  {
    "config_key": "enable_auto_backup",
    "config_value": "true"
  }
]

Response Fields:

config_key (string): Configuration parameter name
config_value (string): Configuration parameter value (stored as string regardless of actual data type)

Empty Response:

If no configuration entries exist, the endpoint returns an empty array [].

Status Codes:

200 OK: Configuration retrieved successfully
400 Bad Request: No project loaded

Batch Processing¶

List Batches¶

Retrieves all processing batches created in the project, ordered by creation timestamp.

Endpoint: GET /batches

Response:

[
  {
    "batch_name": "Manuscript Validation - January 2025",
    "batch_description": "Validation of all manuscript files received in January",
    "status": "completed",
    "created_timestamp": "2025-01-15T10:30:00Z"
  },
  {
    "batch_name": "Image Processing - February 2025",
    "batch_description": "Batch processing of scanned images",
    "status": "in_progress",
    "created_timestamp": "2025-02-01T08:00:00Z"
  }
]

Response Fields:

batch_name (string): Descriptive name of the batch
batch_description (string): Detailed description of the batch purpose
status (string): Current processing status (e.g., pending, in_progress, completed, failed)
created_timestamp (string): ISO 8601 timestamp of batch creation

Ordering:

Batches are sorted by created_timestamp in descending order, with the most recent batches first.

Empty Response:

If no batches exist, the endpoint returns an empty array [].

Status Codes:

200 OK: Batches retrieved successfully
400 Bad Request: No project loaded

API Logging¶

List API Log Entries¶

Retrieves paginated API activity logs showing HTTP requests made to external services (e.g., Zenodo).

Endpoint: GET /apilog

Query Parameters:

page (integer, optional): Page number for pagination. Defaults to 1
limit (integer, optional): Number of items per page. Defaults to 25

Response:

{
  "items": [
    {
      "timestamp": "2025-10-21T11:45:23Z",
      "http_method": "POST",
      "endpoint_url": "https://zenodo.org/api/deposit/depositions",
      "response_status_code": 201,
      "status": "success"
    },
    {
      "timestamp": "2025-10-21T11:40:15Z",
      "http_method": "GET",
      "endpoint_url": "https://zenodo.org/api/deposit/depositions/1234567",
      "response_status_code": 200,
      "status": "success"
    },
    {
      "timestamp": "2025-10-21T11:35:08Z",
      "http_method": "PUT",
      "endpoint_url": "https://zenodo.org/api/deposit/depositions/1234567/files",
      "response_status_code": 500,
      "status": "failed"
    }
  ],
  "totalItems": 487,
  "page": 1,
  "totalPages": 20
}

Response Fields:

items (array): List of API log entries for the current page
timestamp (string): ISO 8601 timestamp of the API request
http_method (string): HTTP method used (e.g., GET, POST, PUT, DELETE)
endpoint_url (string): Full URL of the external API endpoint
response_status_code (integer): HTTP status code returned by the external API
status (string): Interpreted status of the request (e.g., success, failed)
totalItems (integer): Total number of log entries
page (integer): Current page number
totalPages (integer): Total number of pages based on the limit

Ordering:

Log entries are sorted by timestamp in descending order, with the most recent entries first.

Status Codes:

200 OK: Log entries retrieved successfully
400 Bad Request: No project loaded

Credentials Management¶

List API Credentials¶

Retrieves all stored API credentials for external services, without exposing sensitive credential values.

Endpoint: GET /credentials

Response:

[
  {
    "credential_name": "Zenodo Production",
    "credential_type": "zenodo_api_token",
    "is_sandbox": 0
  },
  {
    "credential_name": "Zenodo Sandbox",
    "credential_type": "zenodo_api_token",
    "is_sandbox": 1
  }
]

Response Fields:

credential_name (string): Human-readable name for the credential
credential_type (string): Type or category of the credential (e.g., zenodo_api_token, oauth_token)
is_sandbox (integer): Boolean flag indicating whether this credential is for a sandbox environment (1) or production (0)

Security:

This endpoint does not return actual credential values (API keys, tokens, passwords) for security reasons.

Ordering:

Credentials are sorted alphabetically by credential_name.

Empty Response:

If no credentials are stored, the endpoint returns an empty array [].

Status Codes:

200 OK: Credentials retrieved successfully
400 Bad Request: No project loaded

Pagination¶

Pagination Format¶

Endpoints that support pagination (/files and /apilog) use a consistent pagination structure:

Query Parameters:

page: 1-indexed page number (defaults to 1)
limit: Number of items per page (defaults to 25)

Response Format:

{
  "items": [...],
  "totalItems": 150,
  "page": 2,
  "totalPages": 6
}

Page Calculation:

The totalPages value is calculated using: (totalItems + limit - 1) // limit, ensuring at least 1 page even when totalItems is 0.

Offset Calculation:

The database offset is calculated as: (page - 1) * limit.

Error Responses¶

Standard Error Format¶

All error responses follow a consistent JSON format:

{
  "error": "Descriptive error message"
}

Common Error Scenarios¶

No Project Loaded:

All endpoints return 400 Bad Request when no HDPC project is loaded due to the @project_required decorator.

File Not Found:

GET /files/99999/hierarchy

Response: 404 Not Found with error message "File not found"

Database Query Failed:

GET /records/123/files

Response: 500 Internal Server Error with error message "Database query failed"

General Exception:

The /files/<file_id>/hierarchy endpoint catches all exceptions and returns 500 Internal Server Error with the exception message.

Database Integration¶

Query Service¶

All endpoints use the query_db function from the database service to execute SQL queries against the project SQLite database.

Connection Management:

Most endpoints use query_db, which handles connection lifecycle automatically. The /files/<file_id>/hierarchy endpoint uses get_db_connection directly for manual connection management due to its recursive nature.

Database Path:

All queries execute against project_manager.db_path, which points to the currently loaded project's database file.

Row Factory¶

Query results are returned as dictionaries with column names as keys, making them directly serializable to JSON.

Usage Examples¶

Example 1: Search Files with Pagination¶

GET /files?page=2&limit=10&search=manuscript

Response:

{
  "items": [
    {
      "filename": "manuscript_011.xml",
      "relative_path": "data/manuscripts/manuscript_011.xml",
      "size_bytes": 42300,
      "mime_type": "application/xml",
      "file_type": "xml",
      "status": "processed"
    }
  ],
  "totalItems": 35,
  "page": 2,
  "totalPages": 4
}

Example 2: Get Complete File Hierarchy¶

GET /files/42/hierarchy

Response:

{
  "file_id": 42,
  "filename": "scene.obj",
  "absolute_path": "/project/3d/scene.obj",
  "file_type": "obj",
  "status": "processed",
  "error_message": null,
  "children": [
    {
      "file_id": 43,
      "filename": "scene.mtl",
      "absolute_path": "/project/3d/scene.mtl",
      "file_type": "mtl",
      "status": "processed",
      "error_message": null,
      "children": [
        {
          "file_id": 44,
          "filename": "diffuse.jpg",
          "absolute_path": "/project/3d/textures/diffuse.jpg",
          "file_type": "texture",
          "status": "processed",
          "error_message": null,
          "children": []
        }
      ]
    }
  ]
}

Example 3: Monitor API Activity¶

GET /apilog?page=1&limit=5

Response:

{
  "items": [
    {
      "timestamp": "2025-10-21T13:30:00Z",
      "http_method": "POST",
      "endpoint_url": "https://sandbox.zenodo.org/api/deposit/depositions",
      "response_status_code": 201,
      "status": "success"
    }
  ],
  "totalItems": 142,
  "page": 1,
  "totalPages": 29
}

Example 4: View Pipeline Configuration¶

GET /pipeline_steps

Response:

[
  {
    "modality": "text",
    "component_name": "tei_validator",
    "component_order": 1,
    "is_active": 1,
    "parameters": "{\"encoding\": \"utf-8\", \"validate_schema\": true}"
  },
  {
    "modality": "text",
    "component_name": "metadata_enricher",
    "component_order": 2,
    "is_active": 1,
    "parameters": "{\"add_timestamps\": true}"
  }
]

Data Types & Constraints¶

Integer Fields¶

All integer fields (file_id, record_id, page, limit, size_bytes, etc.) are parsed and validated according to their SQL column types.

Boolean Fields¶

Boolean values are stored as integers in SQLite: 0 for false, 1 for true (e.g., is_active, is_sandbox).

Timestamps¶

All timestamp fields follow ISO 8601 format (e.g., 2025-10-21T13:30:00Z).

JSON Fields¶

Parameters and metadata fields are stored as JSON-serialized strings and must be parsed by the client.