SQL是结构化查询语言(Structured Query Language)的简称,是一种数据库查询和程序设计语言,用来访问和操作数据库系统
数据分析中SQL是必不可少的一项技能,因为需要从数据库中通过SQL来获取需要的数据进行分析,现在数据分析软件如Python,R,SAS等都支持SQL命令 之前对数据表格处理工作中常用R的dplyr包,由最受欢迎的R程序员Hadley Wickham编写;当开始学习SQL后发现,dplyr中的很多函数跟SQL中的操作符极为相似,这是因为:
dplyr 融合了很多在 SQL 数据库中对数据操作的思想, 使得对于数据表 (tibble) 的操作有逻辑且一致. 这样的好处在于, 如果是取用数据库中的分析, 那么所采取的操作步骤并不会有太大的改变, 使用 dbplyr 包就可以方便的用同样的函数去处理数据库中的表格
结合网上教程以及datacamp实际操作教程Intro to SQL for Data Science整理下常用SQL语法
Selecting columns
SELECT 用于从表中选取数据,*
代表所有列
SELECT * FROM people
SELECT name, birthdate FROM people
DISTINCT 用于返回唯一不同的值,可多列
SELECT DISTINCT language FROM films
COUNT 用于返回指定列或多列的值的数目
SELECT COUNT(*) FROM people
SELECT COUNT(DISTINCT language) FROM films
Filtering rows
WHERE 操作符常见的有:
=
equal<>
not equal<
less than>
greater than<=
less than or equal to>=
greater than or equal to
作为过滤的方式:
select count(*) from films where release_year < 2000
select name,birthdate from people where birthdate = '1974-11-11'
SELECT title FROM films WHERE (release_year = 1994 OR release_year = 1995) AND (certification = 'PG' OR certification = 'R')
BETWEEN 操作符在 WHERE 子句中使用,作用是选取介于两个值之间的数据范围
select title,release_year from films where release_year between '1990' and '2000'
IN 操作符在WHERE中使用可以代替多个OR语句
select title,release_year from films where release_year in (1990,2000) and duration > 120
NULL 操作符可以判断是否为空,NOT NULL
则反之
select name from people where deathdate is null
LIKE 操作符用于搜索列中的指定模式,NOT LIKE
则反之
select name from people where name like 'B%'
select name from people where name like '_r%'
select name from people where name not like 'A%'
Aggregate functions
aggregate 函数,比如AVG 函数返回数值列的平均值,MAX 函数返回一列中的最大值,SUM 函数返回数值列的总和
select SUM(gross) from films where release_year >= 2000
select AVG(gross) from films where title like 'A%'
select MIN(gross) from films where release_year=1994
select MAX(gross) from films where release_year between 2000 and 2012
除了聚合函数外,还可以使用基本的算术函数,如:
select (10 / 3)
# 3
使用AS来进行重命名,即指定别名,可以对列名,也可对表名
select title, gross - budget as net_profit from films
select AVG(duration) / 60.0 as avg_duration_hours from films
对于除法等运算,如果保证结果是有小数点的话,则需先保证参与运算的数字也有小数点,可以看下面两个例子的区别:
select 45 / 10 * 100.0;
select 45 * 100.0 / 10
select count(deathdate) * 100.0 / count(*) as percentage_dead from people
Sorting grouping and joins
ORDER BY 主要用于一列或者多列数据进行排序,多列的话用,
分割
select title from films where release_year in (2000, 2012) order by release_year
select title,gross from films where title like 'M%' order by title,gross
默认升序,降序则需要加DESC
select imdb_score,film_id from reviews order by imdb_score desc
GROUP BY 主要用于group一列或者多列数据以便后续的统计合计
select release_year,count(*) from films group by release_year
select release_year,country,MAX(budget) from films group by release_year,country order by release_year,country
由于WHERE语句中无法使用aggregate函数,所以需要用HAVING语句代替
select count(*) from (
select release_year from films
group by release_year
having count(title) > 200
) as data
select release_year,AVG(budget) as avg_budget,AVG(gross) as avg_gross from films
where release_year > 1990
group by release_year
having AVG(budget)>60000000
order by avg_gross desc
LIMIT 可以限制返回的行数
select country, AVG(budget) as avg_budget, AVG(gross) as avg_gross
from films
group by country
having count(title)>10
order by country
limit 5
如果从多个表中获得结果,那么需要用到join函数,不同表通过主键(Primary Key)联系在一起,连接方法有以下几种(join是inner join的缩写):
select title,imdb_score
from films join reviews on films.id = reviews.film_id
where title = 'To Kill a Mockingbird'
- inner join(内连接): 返回key匹配上的行
- left join: 除非返回key匹配的行外,没有左表中没有匹配的行也返回
- right join: 类似于上述left join,就是把右表的数据全部返回,不管左表是否匹配上
- full join: 把左右表的数据全部返回,不管是否匹配
上述只是简单罗列了初步接触SQL中常用的语句,详细的介绍可结合《SQL基础教程》书籍
本文出自于http://www.bioinfo-scrounger.com转载请注明出处