1、hive读取文件机制

1、使用inputformat对象来读取文件，默认是<org.apache.hadoop.mapred.TextInputFormat>。返回一行行的数据。

2、使用SerDe类默认是<org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe>来对每一行数据进行字段切割，对应表中的字段。

2、问题：SerDe默认情况下只支持“单字符”切割，如果分隔符为多字符的，那么可以进行一下处理。

1、使用RegexSerDe通过正则表达式来抽取字段

create table t_bi_reg(id string,name string)  row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'  with serdeproperties(  	'input.regex'='(.*)\\|\\|(.*)',  	'output.format.string'='%1$s %2$s'  )  stored as textfile;  hive>load data local inpath '/root/hivedata/bi.dat' into table t_bi_reg;  hive>select * from t_bi_reg;

2、自定义inputFormat类来处理。

原理：其实就是在inputformat读取数据的时候，将读出来的信息进行多字符转化为单字符，这样就可以用单字符进行切割了。

自定义类：

public class BiDelimiterInputFormat extends TextInputFormat {  	[@Override](https://my.oschina.net/u/1162528)  	public RecordReader
      
        getRecordReader(  	InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {  		reporter.setStatus(genericSplit.toString());  		MyDemoRecordReader reader = new MyDemoRecordReader(  		new LineRecordReader(job, (FileSplit) genericSplit));  		return reader;  	}  	public static class MyDemoRecordReader implements RecordReader
       
         {  		LineRecordReader reader;  		Text text;  		public MyDemoRecordReader(LineRecordReader reader) {  			this.reader = reader;  			text = reader.createValue();  		}  		[@Override](https://my.oschina.net/u/1162528)  		public void close() throws IOException {  			reader.close();  		}  		[@Override](https://my.oschina.net/u/1162528)  		public LongWritable createKey() {  			return reader.createKey();  		}  		[@Override](https://my.oschina.net/u/1162528)  		public Text createValue() {  			return new Text();  		}  		[@Override](https://my.oschina.net/u/1162528)  		public long getPos() throws IOException {  			return reader.getPos();  		}  		@Override  		public float getProgress() throws IOException {  			return reader.getProgress();  		}  		@Override  		public boolean next(LongWritable key, Text value) throws IOException {  			while (reader.next(key, text)) {  				//其实就是在TextInputFormat 的源码中加上一行替换的操作。  				String strReplace = text.toString().toLowerCase().replaceAll("\\|\\|", "|");  				Text txtReplace = new Text();  				txtReplace.set(strReplace);  				value.set(txtReplace.getBytes(), 0, txtReplace.getLength());  				return true;  			}  			return false;  		}  	}  }

3、将这个类打包成jar，放入hive安装目录下的lib文件夹中。
```
hive>add jar /root/apps/hive/lib/myinput.jar
```

4、使用：

使用以下语句建表即可：

hive> create table t_bi(id string,name string)  	   > row format delimited  	   > fields terminated by '|'  	   > stored as inputformat 'cn.itcast.bigdata.hive.inputformat.BiDelimiterInputFormat' outputformat            'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';  hive> load data local inpath '/root/hivedata/bi.dat' into table t_bi;  hive> select * from t_bi;  OK  01 zhangsan  02 lisi

转载于:https://my.oschina.net/liufukin/blog/798534

你可能感兴趣的文章

JS图片跟着鼠标跑效果

查看>>

[SCOI2005][BZOJ 1084]最大子矩阵

查看>>

学习笔记之Data Visualization

查看>>

Leetcode 3. Longest Substring Without Repeating Characters

查看>>

【FJOI2015】金币换位问题

查看>>