Import batch data

Deephaven's batch import feature allows you to efficiently load large datasets from various file formats (CSV, JSON, XML, JDBC) into optimized, partitioned tables. Use batch imports when you need to process historical data, perform bulk updates, or work with datasets that don't require real-time streaming.

Deephaven offers specialized utilities and functions for importing a variety of common data formats, including CSV, XML, JSON, and traditional relational data sources via the JDBC importer. For other formats, you can either transform the data into a supported format or implement a Custom Importer.

The basic steps of the batch import process include:

Prepare your source data (CSV, JSON, XML, JDBC, etc.).
Define a schema (by hand or using the schema inference tool).
Deploy the schema.
Import data into Deephaven tables (partitioned on disk).
Query and analyze your data.

These importers and functions share many common features, which are discussed collectively here, with specific requirements for each format detailed as needed. While there are convenience functions such as read_csv and similar methods for other formats, you should use the dedicated importer classes and functions when you need more fine-grained control, are working with formats other than CSV, or require:

control of the output data type of a column (e.g., you know something is text but might look like a number when the above code reads it, such as various types of order IDs).
auto-mapping or custom mapping of source column name to target column name.
the ability to handle invalid rows (such as extraneous header rows).
something that automatically uses a directory name and a file prefix, and grabs all the files in order and processes them for you.
the ability to import data from different sources and formats into a single table.
your imported data is made available in the Deephaven database, so it can be queried, analyzed, and joined with other in-system data using Deephaven's powerful tools.
the ability to use custom Java code to transform source data during import.

Schema creation and requirements

To import data into a system table using any import source, you must first create and publish a schema. Publishing a schema is a required step for system tables - data cannot be imported into system tables without a published schema. The schema definition must either match the source structure or include additional metadata to describe how to map columns from the source to the target table. Schemas can be created by hand or through Deephaven's schema inference tool and schema inference overview documentation.

Every schema must include:

Table name, namespace, and storageType="NestedPartitionedOnDisk" for batch imports.
Exactly one string column defined as the partitioning column.
Other columns as needed for your data.

Storage type and partitioning

The storageType listed in the schema for CSV, JDBC, JSON and XML table imports must be NestedPartitionedOnDisk. This storage type organizes data in a hierarchical directory structure based on partition values, enabling efficient data retrieval and management.

The partitioning column(s) determine how data is organized on disk: each unique value (or combination of values) creates a subdirectory, and data for that partition is stored within it. This enables efficient organization and retrieval, especially for large datasets. If an import would replace an existing partition, the importer will abort by default, unless instructed to append or replace.

Column name matching: When mapping source data to your schema, column names are matched in a case-sensitive and whitespace-sensitive manner. Be careful to ensure your schema column names exactly match your source data.

Partitioning column values: For the partitioning column (e.g., Date), the values are handled differently depending on the import type:

For single-partition imports: You explicitly provide the partition value as an argument to the importer.
For multi-partition imports: The values are automatically extracted from the source data.

Example schema

A simple schema file generated from a CSV data sample might look like the following:

<Table name="CsvNames" namespace="Import" storageType="NestedPartitionedOnDisk">
   <Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
   <Column name="Date" intradayType="none" dataType="String" columnType="Partitioning"/>
   <Column name="FirstName" dataType="String" columnType="Normal" />
   <Column name="LastName" dataType="String" columnType="Grouping" />
   <Column name="Department" dataType="String" columnType="Normal" />
</Table>

Note

columnType="Normal" is the default setting and can be omitted.

Column types

When defining columns in your schema, you'll specify a columnType attribute that determines how the column behaves:

Normal (columnType="Normal"): Standard columns with no special behavior. This is the default setting and can be omitted.
Partitioning (columnType="Partitioning"): Designates the column used for partitioning data on disk. Every schema must have exactly one partitioning column of type String.
Grouping (columnType="Grouping"): Indicates columns used for grouping operations, which can improve query performance for columns frequently used in groupBy operations.

The example schema above specifies Date as the sole partitioning column. Deephaven will create a directory for each unique date value, and store the corresponding data files within those directories.

Once defined, the schema(s) must be deployed before they can be used.

Caution

If a schema is not deployed, importers will not recognize it and batch imports will fail. Always deploy your schema after creation or making changes.

Advanced example schema

If you need to perform more complex transformations or handle optional/default values, you can use advanced schema features. The following example demonstrates these features:

<Table name="OrderInfo" namespace="Sample" storageType="NestedPartitionedOnDisk">
  <ImportSource name="IrisXML" type="XML">
     <ImportColumn name="source_Date" sourceName="Date" sourceType="String" formula="source_Date.isEmpty() ? null :DBTimeUtils.parseInstant(source_Date.replace(&quot; &quot;,&quot;T&quot;).replace(&quot;Z&quot;,&quot; UTC&quot;))" />
     <ImportColumn name="Max_value" sourceName="Max-value" />
     <ImportColumn name="Product_code" sourceName="Product-code" />
     <ImportColumn name="BackOrder" default="null" />
     <ImportColumn name="Macro_value" sourceName="Macro-value" />
   </ImportSource>
  <!-- Directives for overnight merge operations. Everything in one storage partition daily, chosen round-robin. -->
  <Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__" multiday="0,1,2,3,4,5" />
  <Column name="Date" dataType="java.lang.String" columnType="Partitioning" />
  <Column name="source_Date" dataType="java.time.Instant" columnType="Normal" />
  <Column name="Max_value" dataType="double" columnType="Normal" />
  <Column name="Product_code" dataType="java.lang.String" columnType="Grouping" />
  <Column name="Quantity" dataType="long" columnType="Normal" />
  <Column name="BackOrder" dataType="long" columnType="Normal" />
  <Column name="Warranty" dataType="char" columnType="Normal" />
  <Column name="Comments" dataType="java.lang.String" columnType="Normal" />
  <Column name="Macro_value" dataType="java.lang.String" columnType="Normal" />
</Table>

Let's break down this advanced schema example to understand each component:

Table Definition

<Table name="OrderInfo" namespace="Sample" storageType="NestedPartitionedOnDisk">: Defines a table named "OrderInfo" in the "Sample" namespace, using partitioned storage for batch imports.

Import Source Configuration

<ImportSource name="IrisXML" type="XML">: Specifies that data will be imported from XML files, with a source identifier of "IrisXML".

Import Column Mappings

<ImportColumn> elements define how source data maps to target columns:
- name="source_Date" sourceName="Date": Maps the "Date" field from source to "source_Date" in Deephaven.
- formula="...": Transforms the date string into a proper timestamp using parseInstant.
- default="null": For the BackOrder column, provides a default value when the source field is empty.
- Other entries handle column name legalization (replacing hyphens with underscores).

Partitioning Strategy

<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__" multiday="0,1,2,3,4,5" />:
- Uses a special formula to automatically balance data across partitions.
- multiday="0,1,2,3,4,5" specifies which days of the week (Sunday=0 through Saturday=6) data should be merged into a single partition.

Column Definitions

<Column> elements define the schema structure:
- name="Date" columnType="Partitioning": Designates "Date" as the partitioning column.
- dataType="java.lang.String": Specifies the Java data type for each column.
- columnType="Grouping": For "Product_code", indicates this column is used for grouping operations.
- columnType="Normal": Standard columns with no special behavior.

Column name legalization

Deephaven ensures all column names in your schema are valid and automatically maps source names as needed.

The schema generators "legalize" column names when reading details of a data set. This process follows these rules:

If a source column name is valid for Deephaven and doesn't conflict with the partitioning column name, it's used directly.
If there's a problem or conflict, the schema generator modifies the name to make it valid.
When names are modified, mapping instructions are automatically added for the importer to use later.

In the advanced example schema above, you can see this in action. The column name "Max-value" in the source file is not valid for a Deephaven column name because hyphens aren't allowed. The schema generator renamed it to "Max_value" and added an ImportColumn entry with sourceName="Max-value" to handle the mapping.

Using `ImportColumn`

ImportColumn is a schema element used for batch data imports. Each entry describes how to map values for one column from source data to the target table's schema. It's used within ImportSource blocks in schema files to control how data is imported from formats like CSV, JDBC, JSON, and XML.

The advanced example schema above demonstrates three key uses of ImportColumn:

Mapping source to target names

When source column names differ from desired Deephaven column names (either by choice or due to legalization), ImportColumn entries create the mapping, as shown in this example from the schema above:

<ImportColumn name="Max_value" sourceName="Max-value" />

Setting default values

For sparse primitive columns (e.g., numeric columns that are sometimes empty), ImportColumn can specify default values, as demonstrated in the schema above:

<ImportColumn name="BackOrder" default="null" />

Without this default, import would fail when the importer attempted to parse an empty value into a numeric type like long.

Adding data conversion formulas

The advanced example schema also shows how ImportColumn entries can include formulas to parse or convert source data to specific Deephaven types:

<ImportColumn name="source_Date" sourceName="Date" sourceType="String" formula="source_Date.isEmpty() ? null :DBTimeUtils.parseInstant(source_Date.replace(&quot; &quot;,&quot;T&quot;).replace(&quot;Z&quot;,&quot; UTC&quot;))" />

This example from the schema shows two operations:

Renaming from "Date" in the source file to "source_Date"
Using a formula to reformat a string representation of date/time data for conversion to java.time.Instant via parseInstant

Note that the type of the source_Date column in your schema must match the output type of parseInstant (in this case, java.time.Instant)

Caching

Caching improves performance by storing transformation results for repeated values, reducing redundant computation, and ensuring consistent results. It's automatically enabled for columns with formulas, complex types, or custom transformations. By default, Deephaven maintains a map of source values to calculated values and reuses results for repeated values.

Consider disabling caching when:

Working with columns that have unique values (e.g., UUIDs, timestamps).
Memory usage is a concern and the dataset has few repeated values.
Using custom transformations that are memory-intensive.

Troubleshooting

Schema not found

Symptom: Schema '<Namespace>/<TableName>' not found
Solution: Ensure you've deployed the schema using deploy_schema() before importing data.

Invalid partitioning column

Symptom: Partitioning column must be of type String
Solution: Verify your partitioning column is defined as java.lang.String in your schema and marked with columnType="Partitioning". As noted in the schema requirements above, every batch import schema must include exactly one string column defined as the partitioning column.

Column name mismatches

Symptom: Column 'symbol' not found in schema
Solution: Check for case sensitivity and whitespace in column names. Deephaven's matching is exact.

Memory issues during import

Symptom: OutOfMemoryError during large imports
Solution:
- See the Caching section for guidance on disabling caching for columns with unique values.
- Increase JVM heap size.
- Process data in smaller batches using partitioning.

File not found

Symptom: java.io.FileNotFoundException.
Solution: Verify the file path is correct and accessible to Deephaven.