Hive的fileformat文件存储格式测试,查看每一种格式对数据整体大小的影响(千万不要和Compression混淆)

Hive admin 2年前 (2018-11-14) 238次浏览 0个评论 扫描二维码

首先声明,以下只是对常见的存储格式进行说明,通常下我们直接采用ORC\PARQUET格式进行存储

一、常见的Hive存储格式,以及默认textfile的多重创建方式

1、Hive 常见的文件存储格式fileformat

  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)

2、查看当前fileformat

在hive中使用set,如果没有赋值,则是get效果,默认fileformat存储格式是TextFile
hive (default)> set hive.default.fileformat;
hive.default.fileformat=TextFile

3、查看指定表的format

hive (default)> create table data (id int,url string);
hive (default)> desc formatted data;
OK
# col_name              data_type               comment             

id                      int                                         
url                     string                                      

# Detailed Table Information             
Database:               default                  
Owner:                  hadoop                   
CreateTime:             Mon Dec 10 22:39:47 CST 2018     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://mycluster/user/hive/warehouse/data        
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1544452787          

# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   

4、使用org.apache.hadoop.mapred.TextInputFormat方法建表

create table t_1(id int) 
STORED AS 
inputformat 'org.apache.hadoop.mapred.TextInputFormat' 
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
hive (default)> desc formatted t_1;
OK
# col_name              data_type               comment             

id                      int                                         

# Detailed Table Information             
Database:               default                  
Owner:                  hadoop                   
CreateTime:             Mon Dec 10 22:49:11 CST 2018     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://mycluster/user/hive/warehouse/t_1         
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1544453351          

# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1 

5、使用textfile方法建表

hive (default)> create table t_2 (id int) stored as textfile;
OK
Time taken: 0.21 seconds
hive (default)> desc formatted t_2;
OK
# col_name              data_type               comment             

id                      int                                         

# Detailed Table Information             
Database:               default                  
Owner:                  hadoop                   
CreateTime:             Mon Dec 10 22:57:04 CST 2018     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://mycluster/user/hive/warehouse/t_2         
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1544453824          

# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1     

二、行列存储说明

1、行式存储

数据是按行存储在文件中

2、列式存储

将一列或者多列存储在一个column group下

3、优缺点

行式存储:

优点:

1、保证每一行数据都在一个块中,如果一行有数百列select *时,
列存储需要到数百个块中读取 

列式存储:

优点:通常情况下需要采用列式存储

1、假设C是int,D是bool类型,
如果是行式存储则只能存储成string类型,压缩也是按照string进行压缩;
如果采用列式存储,则int列可以按照int进行压缩,bool列可以按照bool压缩,以此类推,可以有更高的压缩率
2、如果select C,D,也就是说取数据的某几个字段,则列式存储只需要取出指定的字段,不需要操作其他字段

三、各种存储格式说明

注意:存储格式fileformat和压缩Compression是另个完全不同东西,可以同时采用,不同的存储格式 + 不同的压缩格式出来的结果是不相同的

page_views.dat  19M
[hadoop@hadoop000 data]$ ll -h
total 27M
drwxrwxr-x. 2 hadoop hadoop 4.0K Dec  8 13:19 emp_page_views2
-rw-rw-r--. 1 hadoop hadoop  13K Dec  6 21:02 load
-rw-r--r--. 1 hadoop hadoop  19M Dec  2 09:29 page_views.dat
-rw-r--r--. 1 hadoop hadoop 8.7M Dec  2 09:29 page_views.dat.lzo

1、SEQUENCEFILE(行式存储)

1) 格式说明

无压缩类型:如果没有启用压缩(默认设置)那么每个记录就由它的记录长度(字节数)、键的长度,键和值组成。长度字段为4字节。

create table page_views_seq(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
stored as SEQUENCEFILE;

hive (default)> load data local inpath '/home/hadoop/data/page_views.dat' overwrite into table page_views_seq; 
Loading data to table default.page_views_seq
Failed with exception Wrong file format. Please check the file's format.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
由于是SEQUENCEFILE格式,普通的文本格式无法加载,所以需要先将文本文加载到一个textfile的表中

创建textfile的表

create table page_views_text(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
stored as textfile;

load data local inpath '/home/hadoop/data/page_views.dat' overwrite into table page_views_text; 

将textfile表的数据导入到page_views_seq表中

hive (default)> insert into table page_views_seq select * from page_views_text;

比较文件大小

page_views_text 18.1M  3个replication
[hadoop@hadoop000 data]$ hdfs dfs -du -h /user/hive/warehouse/page_views_text/
18.1 M  54.4 M  /user/hive/warehouse/page_views_text/page_views.dat
page_views_seq 19.6M 
[hadoop@hadoop000 data]$ hdfs dfs -du -h /user/hive/warehouse/page_views_seq/                          
19.6 M  58.7 M  /user/hive/warehouse/page_views_seq/000000_1

2、RCFile

RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based data warehouse systems.

行列混合存储

hive (default)> create table page_views_rc(
              > track_time string,
              > url string,
              > session_id string,
              > referer string,
              > ip string,
              > end_user_id string,
              > city_id string
              > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
              > stored as rcfile;
hive (default)> insert into page_views_rc select * from page_views_text;

对比文件大小

text 18.1M
[hadoop@hadoop000 ~]$ hdfs dfs -du -h /user/hive/warehouse/page_views_rc
17.9 M  53.8 M  /user/hive/warehouse/page_views_rc/000000_0

3、PARQUET

hive (default)> create table page_views_parq(
              > track_time string,
              > url string,
              > session_id string,
              > referer string,
              > ip string,
              > end_user_id string,
              > city_id string
              > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
              > stored as PARQUET;
hive (default)> insert overwrite table page_views_parq select * from page_views_text;

对比大小

[hadoop@hadoop000 ~]$ hdfs dfs -du -h /user/hive/warehouse/page_views_parq
13.1 M  39.3 M  /user/hive/warehouse/page_views_parq/000000_3

对parquet进行压缩操作

hive (default)> set parquet.compression=gzip;
hive (default)> create table page_views_parq_gzip(
              > track_time string,
              > url string,
              > session_id string,
              > referer string,
              > ip string,
              > end_user_id string,
              > city_id string
              > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
              > stored as parquet;
hive (default)> insert overwrite table page_views_parq_gzip select * from page_views_text;

对比大小

[hadoop@hadoop000 ~]$ hdfs dfs -du -h /user/hive/warehouse/page_views_parq_gzip
3.9 M  11.6 M  /user/hive/warehouse/page_views_parq_gzip/000000_1

4、ORC

hive (default)> create table page_views_orc(
              > track_time string,
              > url string,
              > session_id string,
              > referer string,
              > ip string,
              > end_user_id string,
              > city_id string
              > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
              > stored as ORC;

大小对比

[hadoop@hadoop000 ~]$ hdfs dfs -du -h /user/hive/warehouse/page_views_orc
2.8 M  8.3 M  /user/hive/warehouse/page_views_orc/000000_3

ORC默认数据

Key

Default

Notes

orc.bloom.filter.columns "" comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp 0.05 false positive probability for bloom filter (must >0.0 and <1.0)

orc.compress

ZLIB

high level compression (one of NONE, ZLIB, SNAPPY)

orc.compress.size

262,144

number of bytes in each compression chunk

orc.create.index

true

whether to create row indexes

orc.row.index.stride

10,000

number of rows between index entries (must be >= 1000)

orc.stripe.size

67,108,864

number of bytes in each stripe

四、各种格式加载数据size查看、计算速度测试(count)

由于个人电脑资源有限,以下测试仅供参考,如需生产上使用,请在自己的集群上进行测试,自行测试过才是最适合自己的,目前基本都是是使用的ORC\PARQUET

0、page_views_seq

HDFS Read: 20569617
Time taken: 20.363 seconds

hive (default)> select count(*) from page_views_seq;
Query ID = hadoop_20181211221313_1966ce78-83a9-4b4c-bc93-bf7c813575e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544522872929_0019, Tracking URL = http://hadoop001:8088/proxy/application_1544522872929_0019/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1544522872929_0019
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-12-11 22:29:58,286 Stage-1 map = 0%,  reduce = 0%
2018-12-11 22:30:04,499 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.14 sec
2018-12-11 22:30:09,728 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.4 sec
MapReduce Total cumulative CPU time: 4 seconds 400 msec
Ended Job = job_1544522872929_0019
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.4 sec   HDFS Read: 20569617 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 400 msec
OK
100000
Time taken: 20.363 seconds, Fetched: 1 row(s)

1、page_views_text

HDFS Read: 19097125
Time taken: 20.572 seconds

hive (default)> select count(*) from page_views_text;
Query ID = hadoop_20181211221313_1966ce78-83a9-4b4c-bc93-bf7c813575e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544522872929_0014, Tracking URL = http://hadoop001:8088/proxy/application_1544522872929_0014/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1544522872929_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-12-11 22:16:25,360 Stage-1 map = 0%,  reduce = 0%
2018-12-11 22:16:30,717 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.96 sec
2018-12-11 22:16:35,997 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.27 sec
MapReduce Total cumulative CPU time: 3 seconds 270 msec
Ended Job = job_1544522872929_0014
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.27 sec   HDFS Read: 19097125 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 270 msec
OK
100000
Time taken: 20.572 seconds, Fetched: 1 row(s)

2、page_views_rc

HDFS Read: 17749891
Time taken: 21.704 seconds

hive (default)> select count(*) from page_views_rc;
Query ID = hadoop_20181211221313_1966ce78-83a9-4b4c-bc93-bf7c813575e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544522872929_0015, Tracking URL = http://hadoop001:8088/proxy/application_1544522872929_0015/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1544522872929_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-12-11 22:19:42,332 Stage-1 map = 0%,  reduce = 0%
2018-12-11 22:19:47,538 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.81 sec
2018-12-11 22:19:54,841 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.05 sec
MapReduce Total cumulative CPU time: 3 seconds 50 msec
Ended Job = job_1544522872929_0015
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.05 sec   HDFS Read: 17749891 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 50 msec
OK
100000
Time taken: 21.704 seconds, Fetched: 1 row(s)

3、page_views_orc

HDFS Read: 23489
Time taken: 25.52 seconds

hive (default)> select count(*) from page_views_orc;
Query ID = hadoop_20181211221313_1966ce78-83a9-4b4c-bc93-bf7c813575e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544522872929_0016, Tracking URL = http://hadoop001:8088/proxy/application_1544522872929_0016/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1544522872929_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-12-11 22:21:47,927 Stage-1 map = 0%,  reduce = 0%
2018-12-11 22:21:56,325 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.14 sec
2018-12-11 22:22:04,608 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.52 sec
MapReduce Total cumulative CPU time: 4 seconds 520 msec
Ended Job = job_1544522872929_0016
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.52 sec   HDFS Read: 23489 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 520 msec
OK
100000
Time taken: 25.52 seconds, Fetched: 1 row(s)

4、page_views_parq

HDFS Read: 8994
Time taken: 20.349 seconds

hive (default)> select count(*) from page_views_parq;
Query ID = hadoop_20181211221313_1966ce78-83a9-4b4c-bc93-bf7c813575e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544522872929_0017, Tracking URL = http://hadoop001:8088/proxy/application_1544522872929_0017/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1544522872929_0017
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-12-11 22:24:34,635 Stage-1 map = 0%,  reduce = 0%
2018-12-11 22:24:42,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.8 sec
2018-12-11 22:24:49,100 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.09 sec
MapReduce Total cumulative CPU time: 3 seconds 90 msec
Ended Job = job_1544522872929_0017
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.09 sec   HDFS Read: 8994 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 90 msec
OK
100000
Time taken: 20.349 seconds, Fetched: 1 row(s)

5、page_views_parq_gzip

HDFS Read: 9025
Time taken: 25.399 seconds

hive (default)> select count(*) from page_views_parq_gzip;
Query ID = hadoop_20181211221313_1966ce78-83a9-4b4c-bc93-bf7c813575e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1544522872929_0018, Tracking URL = http://hadoop001:8088/proxy/application_1544522872929_0018/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1544522872929_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-12-11 22:26:11,779 Stage-1 map = 0%,  reduce = 0%
2018-12-11 22:26:19,130 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.53 sec
2018-12-11 22:26:27,419 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.79 sec
MapReduce Total cumulative CPU time: 3 seconds 790 msec
Ended Job = job_1544522872929_0018
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.79 sec   HDFS Read: 9025 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 790 msec
OK
100000
Time taken: 25.399 seconds, Fetched: 1 row(s)

五、ORC发展

出处:
https://zh.hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/



codeobj , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Hive的fileformat文件存储格式测试,查看每一种格式对数据整体大小的影响(千万不要和Compression混淆)
喜欢 (0)
[a37free@163.com]
分享 (0)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址