Hive学习路线-Hive的DQL - 服务器托管|北京服务器租用|机房托管租用|IDC托管租用|机房机柜带宽租用-价格及费用咨询

七. Hive的DQL

1. 语法

select [distinct|all] expr，expr… …
from tbName
where whereCondition
group by colList
having havingCondition
order by colList
cluster by colList
limit num

2. where子句

is null
is not null
between…and…
in(v1,v2,v3,… …) | not in( v1,v2,v3,… …)
and
or
like | rlike
% 通配符
_ 占位符

3. group by 子句

概述：

按照某些字段的值进行分组，有相同值的放置到一起

通常和聚合函数一起使用

SQL案例：

select c1 ,c2

where condition ——>map端执行

group by c1,c2 ——>reduce端执行

having condition ——>reduce端执行

3.1 案例：

1. 建表
create table order_tb(oid int ,num int ,pname string ,uid int)
row format delimited fields terminated by ‘,’;

导入数据：
load data local inpath ‘/home/zhangsan/order.txt’ into table order_tb;

create table user_tb(uid int ,name string ,addr string)
row format delimited fields terminated by ‘,’;

导入数据：
load data local inpath ‘/home/zhangsan/user.txt’ into table user_tb;

2. 注意
select 后面非聚合列，必须要出现在group by 中
可以使用as对表和列设置别名

having和where的区别：
位置，condition使用的过滤字段，针对的数据（单位：一条数据；单位：一组数据）

3. 按照用户id分组查询订单表
select o.uid from order_tb as o group by o.uid

使用分组函数|聚合函数：查询用户购买商品数量
select o.uid , count(o.num) from order_tb as o group by o.uid

使用分组函数|聚合函数：查询用户购买商品数量，显示数量大于2的信息
select o.uid , count(o.num) from order_tb as o group by o.uid having count(o.num) >2;

4. 常见的聚合参数
count()
max()
min()
sum()
avg()

4. order by 子句

概述

按照某些字段进行排序

现象：order by 可以使用多列进行排序（全局排序），此操作在reduce阶段进行，只有一个reduce

问题：数据量很大

全局排序：order by

asc|desc

案例

select * from order_tb order by num desc;
select * from order_tb order by num ,oid 服务器托管网asc ; // 多字段排序

5. sort by 排序

在mapreduce内部的排序，不是全局排序

1. 设置reduce的个数
默认

6. distribute by 分区

mapreduce中的partition操作,进行分区，结果sort by使用

distribute by字句需要写在sort by之前（先分区，在排序）

案例：sql转成mr时，内部使用oid字段进行分区，使用num字段进行排序
insert overwrite local directory ‘服务器托管网/home/zhangsan/distributebyResult’
select * from order_tb distribute by oid sort by num desc;

7. cluster by 分区并排序

insert overwrite local directory ‘/home/zhangsan/clusterbyResult’
select * from cluster by num ;

注意：分区和排序的字段需要一致
不可以指定排序的方向

8. join 子句（多表连接查询）（子查询）

概述

两张表m,n之间可以按照on的条件进行连接，m中的一条记录和n表中的一条记录组成一个新的记录。

join 等值连接（内连接）：只有某个值在m和n中同时存在时，才连接。

left outer join (左外连接) ：左表中的值无论在右表中是否存在，都输出；右表中的值只有在左表中存在才输出

right outer join (右外连接) ：相反

mapjoin （map端完成连接）: 在map端进行表的连接操作，基于内存（优化的方式）

案例

等值连接：

select from user_tb u join order_tb o ; // 笛卡尔积现象

select u.name,o.num,o.pname from user_tb u join order_tb o on u.uid=o.uid ; //添加过滤条件避免，笛卡尔积现象

左外连接：（和右外连接相对的）

insert into user_tb values(10004,‘xiaoliu’,‘beijing’);

select u.name,o.num,o.pname from user_tb u

left join order_tb o on u.uid=o.uid ;

注意：表和表的一次连接，代表一个MapReduce 作业job

如果sql语句中存在多个表之间的连接，会创建多个job作业，连接的顺序就是job执行的顺序

子查询

SQL语句的嵌套查询

select * from (select * from …)
selct * from tbName where col = (select * from …)

嵌套的SQL语句中，子SQL语句的结果可以作为查询的数据源，也可以作为查询的条件

统计汽车类型的分布：

商用 n1/count

个人 n2/count

hive:

select 汽车的类型 , count() from cart group by 汽车类型

select count() from cart

select 汽车类型，当前类型数量/ (select count(*) from cart ) from
(select 汽车的类型 , count(*) from cart group by 汽车类型)

服务器托管，北京服务器托管，服务器租用 http://www.fwqtg.net
机房租用，北京机房租用，IDC机房托管， http://www.fwqtg.net

相关推荐: 分布式存储 vs. 全闪集中式存储：金融数据仓库场景下的性能对比

作者：深耕行业的 SmartX 金融团队张德敏近年来随着金融行业的高速发展，经营决策者及监管机构对信息时效性的要求越来越高，科技部门面临诸多挑战。例如，不少金融机构使用数仓业务系统，为公司高层提供日常经营报表，同时支持监管报送等应用。该业务系统通常是 I/O…

服务器托管，北京服务器托管，服务器租用，机房机柜带宽租用