Skip to content
This repository was archived by the owner on Sep 29, 2025. It is now read-only.

Query and Write Excel documents using SQL in Apache Hive

Jörn Franke edited this page Dec 19, 2018 · 15 revisions

Prerequisite

You have deployed the Hive Serde. As described there two SerDe exists:

  • One to represent the data in the Excel sheet as primitive data types, e.g. string, decimal, boolean, date etc. (ExcelSerde)
  • One to represent the cells in the Excel sheet (as of version 1.2.3). Each Hive row corresponds to a cell in Excel according to five columns (formattedValue, comment, formula, address, sheetName) (ExcelSpreadSheetCellDAOSerde)

Optionally you can put the file testsimple.xlsx for the ExcelSerde or the file excel2013test.xlsx for the ExcelSpreadSheetCellDAOSerde on HDFS in the folder /user/office/files.

However, you can try it out with any of your files by adapting the commands below.

Usage (ExcelSerde)

You can create any database in Hive:

create database excel;

You can then create a table containing Excel data as follows:

use excel;

Afterwards you need to define a schema of your Excelfile (note only primitive datatypes, such as boolean, int, byte, string, decimal, float, date etc. are supported). See also Hive Serde to configure detection of date, times, decimal etc. formats. The schema describes the datatypes used in each cell of the Excel file. For instance, the create external table statement for this file would be the following if the file is stored in '/user/office/files':

create external table ExcelTable(decimalsc1 decimal(3,2), booleancolumn boolean, datecolumn date, stringcolumn string, decimalp8sc3 decimal(8,3), bytecolumn tinyint, shortcolumn smallint, intcolumn int, longcolumn bigint) ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.simple.decimalFormat"="DE","hadoopoffice.read.header.read"="true", "hadoopoffice.read.locale.bcp47"="DE","hadoopoffice.write.locale.bcp47"="DE");

As you can see options are defined in the TBLPROPERTIES

You can then execute some commands on the table:

Show the schema:

describe ExcelTable;

Get number of rows

select count(*) from ExcelTable;

Display the first 10 rows

select * from ExcelTable LIMIT 10;

You can create a new Excel file by using CTAS:

create  table ExcelOut ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/output' TBLPROPERTIES("office.hive.write.defaultSheetName"="FirstSheet","hadoopoffice.read.header.read"="true","hadoopoffice.write.header.write"="true", "hadoopoffice.write.mimeType"="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "hadoopoffice.read.simple.decimalFormat"="DE","hadoopoffice.write.simple.decimalFormat"="DE","hadoopoffice.read.locale.bcp47"="DE","hadoopoffice.write.locale.bcp47"="DE")
AS 
select * from ExcelTable;

In this example the source data is coming from ExcelTable (defined above) and the first row contains the header names in Hive. Note that we explicitly specify the decimalFormat to be German (DE), because this is the case for the source table.

You can find more examples in the example folder.

Usage (ExcelSpreadSheetCellDAOSerde, as of release 1.2.3)

You can create any database in Hive:

create database excel;

You can then create a table containing Excel data as follows:

use excel;

Afterwards, you create an external table representing an Excel data stored in /user/office/files

Contrary to the normal Excel Serde each row in this Serde describes one cell in Excel, ie the table has 5 columns: formattedValue, comment, formula, address, sheetName (see here).

Note you can specify all options of the HadoopOffice library.

create external table ExcelDAOTable ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.daoserde.ExcelSpreadSheetCellDAOSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelCellFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelCellFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.locale.bcp47"="DE","hadoopoffice.write.locale.bcp47"="DE");

As you can see options are defined in the TBLPROPERTIES

You can display the schema as follows

describe ExcelDAOTable;

You can display the number of rows (corresponding to the total number of cells in the Excel!)

select count(*) from ExcelDAOTable;

You can display the first 10 cells as follows:

select * from ExcelDAOTable LIMIT 10;

You can create a new Excel file by using CTAS:

create  table ExcelDAOOut ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.daoserde.ExcelSpreadSheetCellDAOSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelCellFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelCellFileOutputFormat' LOCATION '/user/office/output' TBLPROPERTIES("hadoopoffice.write.mimeType"="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "hadoopoffice.read.locale.bcp47"="DE","hadoopoffice.write.locale.bcp47"="DE")
AS 
select * from ExcelDAOTable;

In this example the source data is coming from ExcelDAOTable (defined above). Please note that the ExcelDAOTable has 5 columns of type String where each row correspond to one cell in Excel, e.g. it has the columns formattedValue,comment,formula,address, sheetName. You can select from any table that returns 5 string columns corresponding to the information in an Excel cell. This means you can also use data from non-Excel tables to be written to the Excel.

You can find more examples in the example folder.

Clone this wiki locally