One of the most important aspect of architecting a solution with Big Data is choosing a proper Data Storage options in Hadoop/Spark. Hadoop does not have a standard data storage format ,but as a standard file system ,allows for storage of data in any format ,whether it’s text,binary ,image or other.So users using Hadoop have complete control and numerous options for how data can be stored in HDFS. These options can be applied not only for the raw data being processed ,but also for the intermediate data generated during data processing and derived data from the result of the data processing.
This blog will summarize different data/file formats that is used by different Big data stack including Spark ,Hadoop ,Hive and others .
Standard File Formats :
Some standard file formats in Hadoop are text files(CSV,XML) or binary file types(images)
(a)Text Data: One of the common use of Hadoop is the storage and analysis of logs such as web logs and server logs.These data usually comes in the form of CSV files or unstructured data such as emails.CSV files are still quite common and often used for exchanging data between Hadoop and external systems. They are readable and ubiquitously parsable. They come in handy when doing a dump from a database or bulk loading data from Hadoop into an analytic database. However, CSV files do not support block compression, thus compressing a CSV file in Hadoop often comes at a significant read performance cost.
When working with Text/CSV files in Hadoop, never include header or footer lines. Each line of the file should contain a record. This, of course, means that there is no metadata stored with the CSV file. You must know how the file was written in order to make use of it. Also, since the file structure is dependent on field order, new fields can only be appended at the end of records while existing fields can never be deleted. As such, CSV files have limited support for schema evolution.
Text files are ordinary files which can be in either structured flat files fromat like Comma Seperated Values(CSV),Tab Seperated Values(TSV)
(CSV,TSV) or other files like email
unstructured text file which have one records per file and are seperated by delimiters . Comma-separated values (CSVs) and Tab-separated values (TSVs) are common flat files which can be used with different data processing platform like Hadoop ,Spark ,Pig and Hive .
(b) Structured Text Data : This is a more specialized form of text files such as XML or JSON .These formats brings a special challenge with Hadoop as splitting XML and JSON files are very tricky and Hadoop does not provide a built-in InputFormat for either of them.Processing JSON is more challenging than XML since there are no tokens to mark the beginning or end of the record in JSON
JSON records are different from JSON Files in that each line is its own JSON datum – making the files splittable. Unlike CSV files, JSON stores metadata with the data, fully enabling schema evolution. However, like CSV files, JSON files do not support block compression. Additionally, JSON support was a relative late comer to the Hadoop toolset and many of the native serdes contain significant bugs. Fortunately, third party serdes are frequently available and often solve these challenges.
Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not store metadata with the data so the only schema evolution option is appending new fields. However, unlike CSV, sequence files do support block compression. Due to the complexity of reading sequence files, they are often only used for in flight data such as intermediate data storage used within a sequence of MapReduce jobs.
Mainly the text files are the most common source data format stored in Hadoop HDFS.But Hadoop can be also used to process binary files such as images.To store and process binary files ,use of container format such as Sequence File is preferred .
Serialization refers to the process of turning data structures into byte streams either for storage or transmission over a network. Conversely, deserialization is the process of converting a byte stream back into data structures.Because of the Serialization data can be converted into a format that can be efficiently stored as well transferred across a network connection .
Hadoop uses the
Writables as the main serialization format.Writables are compact and fast but difficult to extend or use from languages other than Java.However,there are other serialization frameworks being used within the Hadoop ecosystem as described below.
(a)Thrift : It was developed at Facebook as a framework for implementing cross-language interfaces to services .It uses an Interface Definition Language (IDL) to define interfaces, and uses an IDL file to generate stub code to be used in implementing RPC clients and servers that can be used across languages.Thrift ,although used for data serialization with Hadoop has several drawbacks ,as it is not splittable and compressible and further lacks native support in Hadoop
(b) Protocol Buffers (protobuf): It was developed at Google to facilitate data exchange between services written in different Languages .It is defined via an IDL like Thrift ,which is used to generate stub code for multiple languages.It is neither compressible not splittable and also have no MapReduce Support like Thrift
(c)Avro Files : Avro is a language-neutral data serialization system designed to address the major downside of Hadoop Writables: lack of language portability.It is described through a language-independent schema same like Thrift and Protocol Buffers.Avro stores the schema in the header of each file ,which means it stores the metadata with the data.If you use one language to write a Avro file,you can use any other language to read the files later .
In addition to better native support for MapReduce ,Avro data files are splittable and block compressible.An important point which makes Avro better than Sequence Files for hadoop based applications is the support of schema evolution .It means that schema which is used to read the Avro file does not need to match the schema used to write the file.This makes Avro the epitome of schema evolution support since you can rename, add, delete and change the datatypes of fields by defining new independent schema.This makes it possible to add new fields to a schema as requirements change.Avro schemas are usually written in JSON, but may also be written in Avro IDL, which is a C-like language. As just noted, the schema is stored as part of the file metadata in the file header.
In recent modern Big Data applications,numerous databases(NoSQL) have introduced columnar storage ,which provides several benefits over traditional row-oriented databases
With the use of column oriented database ,I/O and decompression can be skipped on columns that are not part of the query. It works in the use cases when we need to access only small subsets of columns .If the use case requires accessing many columns at a single query ,row-oriented databases in preferable. As data within same columns are same in comparison to data in block of rows,it is very efficient in terms of compression It is also useful for ETL or data warehousing use cases where client wants to aggregate columns values over large collection of records. Many Hadoop vendors like Cloudera,Hortonworks and MapR are utilizing the columnar file formats in their own Hadoop products .Some of these file formats are explained below .
(a) RC Files
RC Files or Record Columnar Files were the first columnar file format adopted in Hadoop. Like columnar databases, the RC file enjoys significant compression and query performance benefits.The RCFile format was developed specifically to provide efficient processing for MapReduce applications, although in practice it’s only seen use as a Hive storage format.The RCFile format was developed to provide fast data loading, fast query processing, and highly efficient storage space utilization. The RCFile format breaks files into row splits, then within each split uses column-oriented storage.
However, the current serdes for RC files in Hive and other tools do not support schema evolution. In order to add a column to your data you must rewrite every pre-existing RC file. Also, although RC files are good for query, writing an RC file requires more memory and computation than non-columnar file formats. They are generally slower to write.
It is a data storage structure that determines how to minimize the space required for relational data in HDFS (Hadoop Distributed File System). It does this by changing the format of the data using the MapReduce framework.
RCFiles, short of Record Columnar File, are flat files consisting of binary key/value pairs, which shares much similarity with SequenceFile. RCFile stores columns of a table in a record columnar way. It first partitions rows horizontally into row splits. and then it vertically partitions each row split in a columnar way.
(b) ORC Files
ORC Files or Optimized RC Files were invented to optimize performance in Hive and are primarily backed by HortonWorks. ORC files enjoy the same benefits and limitations as RC files just done better for Hadoop. This means ORC files compress better than RC files, enabling faster queries. However, they still don’t support schema evolution. Some benchmarks indicate that ORC files compress to be the smallest of all file formats in Hadoop.It is also a splittable storage format which the Hive type model, including new primitives such as decimal and complex types.
A drawback of ORC as of this writing is that it was designed specifically for Hive, and so is not a general-purpose storage format that can be used with non-Hive MapReduce interfaces such as Pig or Java, or other query engines such as Impala.It is worthwhile to note that, at the time of this writing, Cloudera Impala does not support ORC files.
An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer. The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS. The file footer contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.
(c) Parquet Files
Parquet files are columnar data format that is suitable for different MapReduce interfaces such as Java, Hive, and Pig, and also suitable for other processing engines such as Impala and Spark.
Performance of Parquet is pretty good as RC and ORC but is generally slower to write that other column formats . Unlike RC and ORC files Parquet serdes supports schema evolution. If any new columns has to be added in parquet it has to be added at the end of the structure .
At present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such as Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala.
Modern data and solution architect of Big Data applications are always concerned about reducing the storage requirements and improving the data processing performance .Compression is another important considerations for storing data in hadoop which will help to solve these problem .Please refer to my post to learn more about the compression techniques in hadoop.
Choosing a File Format in Hadoop
There are mainly three types of performance to consider when choosing a file format in Hadoop framework.
Write performance – how fast can the data be written.
Partial read performance – how fast can you read individual columns within a file.
Full read performance – how fast can you read every data element in a file.
A columnar, compressed file format like Parquet or ORC may optimize partial and full read performance, but they do so at the expense of write performance. Conversely, uncompressed CSV files are fast to write but due to the lack of compression and column-orientation are slow for reads. You may end up with multiple copies of your data each formatted for a different performance profile.
Environmental factors for choosing file format
As discussed, each file format is optimized by purpose. Your choice of format is driven by your use case and environment. Here are the key factors to consider:
Hadoop Distribution- Cloudera and Hortonworks support/favor different formats
Schema Evolution- Will the structure of your data evolve?
Processing Requirements- Will you be crunching the data and with what tools?
Read/Query Requirements- Will you be using SQL on Hadoop? Which engine?
Extract Requirements- Will you be extracting the data from Hadoop for import into an external database engine or other platform?
Storage Requirements- Is data volume a significant factor? Will you get significantly more bang for your storage buck through compression?
So, with all the options and considerations are there any obvious choices? If you are storing intermediate data between MapReduce jobs, then Sequence files are preferred. If query performance against the data is most important, ORC (HortonWorks/Hive) or Parquet (Cloudera/Impala) are optimal — but these files will take longer to write. (We have also seen order of magnitude query performance improvements when using Parquet with Spark SQL.) Avro is great if your schema is going to change over time, but query performance will be slower than ORC or Parquet. CSV files are excellent if you are going to extract data from Hadoop to bulk load into a databa
RC vs ORC File format
ORC file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
- Compared with RCFile format, for example, ORC file format has many advantages such as:
- a single file as the output of each task, which reduces the NameNode’s load
- Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
- light-weight indexes stored within the file
- block-mode compression based on data type
- concurrent reads of the same file using separate RecordReaders
- ability to split files without scanning for markers
- bound the amount of memory needed for reading or writing
- metadata stored using Protocol Buffers, which allows addition and removal of fields