Case sensitivity difference for tHiveInput when running in a DI Job versus a Spark Job

TalendSolutionExpert · Feb 9, 2024 1:22:49 PM

Problem

When using a tHiveInput component in a standard Job, a query containing lowercase/uppercase column names returns all columns, even if the schema only contains lowercase column names. The same query in tHiveInput with the same schema does not return the column with the name given in uppercase in the query when run in a Big Data Batch.

Root Cause

This is a known issue and Hive is not at fault here. Hive is not case sensitive and always uses lowercase independently of the case used in the Studio. However, the difference between a DI Job and a Spark Job is the use of Avro for Spark.

With Hive, you can request fields with any case but Spark, uses Avro and it is case sensitive. Moreover, Avro field names are created with the Hive query case but fields are retrieved with Studio case in other components.

This means that if in your Hive query field names are not the same as the Studio schema, they are retrieved from Hive, but not from the Avro payload.

Example:

Studio schema: col1, COL2
In Hive, it gives: col1, col2
Back in Studio with a Hive query of "SELECT col1, col2 FROM ..." retrieves col1, col2 columns
Creation of an Avro payload with col1, col2 fields
Then Studio tries to find col1, COL2 fields
col1, null

Doing it the Avro way:

Studio schema: col1, COL2
In Hive, it gives: col1, col2
Back in Studio with a Hive query of "SELECT col1, COL2 FROM ..." retrieves col1, col2 columns
Creation of an Avro payload with col1, COL2 fields
Then Studio tries to find col1, COL2 fields
col1, COL2

Solution

In a Spark Job, the case for column names in the query must match case in the schema used.

Workaround

The other workaround is to use only lowercase letters.

Case sensitivity difference for tHiveInput when running in a DI Job versus a Spark Job