This article describes the Dell EMC ECS’s support for Apache Hadoop file formats in terms of disk space utilization. To determine this, we will use Apache Hive service to create and store different file format tables and analyze the disk space utilization by each table on the ECS storage.
Apache Hive supports several familiar file formats used in Apache Hadoop. Hive can load and query different data files created by other Hadoop components such as PIG, Spark, MapReduce, etc. In this article, we will check Apache Hive file formats such as TextFile, SequenceFIle, RCFile, AVRO, ORC and Parquet formats. Cloudera Impala also supports these file formats.
To begin with, let us understand a bit about these Bigdata File formats. Different file formats and compression codes work better for different data sets in Hadoop, the main objective of this article is to determine their supportability on DellEMC ECS storage which is a S3 compatible object store for Hadoop cluster.
Following are the Hadoop file formats
Test File: This is a default storage format. You can use the text format to interchange the data with another client application. The text file format is very common for most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by a newline character(\n).
The test format is a simple plane file format. You can use the compression (BZIP2) on the text file to reduce the storage spaces.
Sequence File: These are Hadoop flat files that store values in binary key-value pairs. The sequence files are in binary format and these files can split. The main advantage of using the sequence file is to merge two or more files into one file.
RC File: This is a row columnar file format mainly used in Hive Datawarehouse, offers high row-level compression rates. If you have a requirement to perform multiple rows at a time, then you can use the RCFile format. The RCFile is very much like the sequence file format. This file format also stores the data as key-value pairs.
AVRO File: AVRO is an open-source project that provides data serialization and data exchange services for Hadoop. You can exchange data between the Hadoop ecosystem and a program written in any programming language. Avro is one of the popular file formats in Big Data Hadoop based applications.
ORC File: The ORC file stands for Optimized Row Columnar file format. The ORC file format provides a highly efficient way to store data in the Hive table. This file system was designed to overcome limitations of the other Hive file formats. The Use of ORC files improves performance when Hive is reading, writing, and processing data from large tables.
More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
Parquet File: Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.
More information on the Parquet file format: https://parquet.apache.org/documentation/latest/
Please note for below testing Cloudera CDP Private Cloud Base 7.1.6 Hadoop cluster is used.
Disk Space Utilization on Dell EMC ECS
What is the space on the disk that is used for these formats in Hadoop on Dell EMC ECS? Saving on disk space is always a good thing, but it can be hard to calculate exactly how much space you will be used with compression. Every file and data set is different, and the data inside will always be a determining factor for what type of compression you’ll get. The text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.
As a simple test, we took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html. The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb. We then uploaded the data to Dell EMC ECS through s3a protocol, and created an external table on top of the uncompressed data set:
Copy the original dataset to Hadoop cluster
[root@hop-kiran-n65 ~]# ll total 111128 -rwxr-xr-x 1 root root 113753229 May 28 02:25 2008.csv.bz2 -rw-------. 1 root root 1273 Oct 31 2020 anaconda-ks.cfg -rw-r--r--. 1 root root 36392 Dec 15 07:48 docu99139 [root@hop-kiran-n65 ~]# hadoop fs -put ./2008.csv.bz2 s3a://hive.ecs.bucket/diff_file_format_db/bz2/ [root@hop-kiran-n65 ~]# hadoop fs -ls s3a://hive.ecs.bucket/diff_file_format_db/bz2/ Found 1 items -rw-rw-rw- 1 root root 113753229 2021-05-28 02:00 s3a://hive.ecs.bucket/diff_file_format_db/bz2/2008.csv.bz2 [root@hop-kiran-n65 ~]#
From Hadoop Compute Node, create a database with data location on ECS bucket and create an external table for the flights data uploaded to ECS bucket location.
DROP DATABASE IF EXISTS diff_file_format_db CASCADE; CREATE database diff_file_format_db COMMENT 'Holds all the tables data on ECS bucket' LOCATION 's3a://hive.ecs.bucket/diff_file_format_db' ; USE diff_file_format_db; Create external table flight_arrivals_txt_bz2 ( year int, month int, DayofMonth int, DayOfWeek int, DepTime int, CRSDepTime int, ArrTime int, CRSArrTime int, UniqueCarrier string, FlightNum int, TailNum string, ActualElapsedTime int, CRSElapsedTime int, AirTime int, ArrDelay int, DepDelay int, Origin string, Dest string, Distance int, TaxiIn int, TaxiOut int, Cancelled int, CancellationCode int, Diverted int, CarrierDelay string, WeatherDelay string, NASDelay string, SecurityDelay string, LateAircraftDelay string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location 's3a://hive.ecs.bucket/diff_file_format_db/bz2/';
The total number of records in this master table is
select count(*) from flight_arrivals_txt_bz2 ; +----------+ | _c0 | +----------+ | 7009728 | +----------+
Similarly, create different file format tables using the master table
To create different file formats files by simply specifying ‘STORED AS FileFormatName’ option at the end of a CREATE TABLE Command.
Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals_txt_bz2; Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals_txt_bz2; Create external table flight_arrivals_external_textfile stored as textfile as select * from flight_arrivals_txt_bz2; Create external table flight_arrivals_external_sequencefile stored as sequencefile as select * from flight_arrivals_txt_bz2; Create external table flight_arrivals_external_rcfile stored as rcfile as select * from flight_arrivals_txt_bz2; Create external table flight_arrivals_external_avro stored as avro as select * from flight_arrivals_txt_bz2;
Disk space utilization of the tables
Now, let us compare the disk usage on ECS of all the files from Hadoop compute nodes.
[root@hop-kiran-n65 ~]# hadoop fs -du -h s3a://hive.ecs.bucket/diff_file_format_db/ | grep flight_arrivals 597.7 M 597.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_avro 93.5 M 93.5 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_orc 146.6 M 146.6 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_parquet 403.1 M 403.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_rcfile 751.1 M 751.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_sequencefile 670.7 M 670.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_textfile [root@hop-kiran-n65 ~]#
Summary
From the below table we can conclude that Dell EMC ECS as S3 storage supports all the Hadoop file formats and provides the same disk utilization as with the traditional HDFS storage.
Compressed Percentage lower is batter and Compression ratio higher is better.
Format |
Size |
Compressed% |
Compressed Ratio |
CSV (Text) | 670.7 M | – | – |
BZ2 | 108.5 M | 16.18% | 83.82% |
ORC | 93.5 M | 13.94% | 86.06% |
Parquet | 146.6 M | 21.85% | 78.15% |
RC FIle | 403.1 M | 60.10% | 39.90% |
AVRO | 597.7 M | 89.12% | 10.88% |
Sequence | 751.1 M | 111.97% | -11.87% |
Here the default settings and values were used to create all the different file format tables, there were no other optimizations done for this testing. Each file format ships with many options and optimizations to compress the data, only the defaults that ship CDP pvt cloud base 7.1.6 were used.