通過八爪魚軟件搬男,爬取了前程無憂網(wǎng)和獵聘網(wǎng)關(guān)鍵詞“數(shù)據(jù)分析”的全國范圍內(nèi)1月內(nèi)發(fā)布的企業(yè)職位撒穷。
本次分析源數(shù)據(jù)的職位發(fā)布日期:2019年7月20日-2019年7月30日
清洗完成后,用于分析的數(shù)據(jù)2875條
數(shù)據(jù)清洗
數(shù)據(jù)結(jié)構(gòu)
爬取到的原始數(shù)據(jù)有X條邢羔,使用MySQL對(duì)爬取的原始數(shù)據(jù)進(jìn)行清洗。
原始數(shù)據(jù)字段:
id int(255):序號(hào)
jobtitle VARCHAR(255) :職位名稱
company VARCHAR(255):公司名稱
salary VARCHAR(255):薪酬
position VARCHAR(255):工作地
qualification VARCHAR(255):職位要求,包括學(xué)歷囚似、工作經(jīng)驗(yàn)、語言线得、年齡
tag VARCHAR(255):職位標(biāo)簽饶唤,包括五險(xiǎn)一金、年底雙薪贯钩、節(jié)日福利等
industry VARCHAR(255):行業(yè)
size VARCHAR(255):公司規(guī)模募狂,人員數(shù)
職位名稱
刪除空值:
DELETE
FROM
liepin
WHERE
jobtitle IS NULL;
刪除與數(shù)據(jù)分析無關(guān)的職位:
DELETE
FROM
51job
WHERE
jobtitle NOT LIKE '%數(shù)據(jù)%分析%'
AND jobtitle NOT LIKE '%大數(shù)據(jù)%'
AND jobtitle NOT LIKE '%數(shù)據(jù)%運(yùn)營%'
AND jobtitle NOT LIKE '%DBA%';
薪酬數(shù)據(jù)
將薪酬算為平均年薪:
--提取 salary,獲得 min - max - average salary
--新建列
ALTER TABLE liepin ADD ( min_salary FLOAT ( 10 ), max_salary FLOAT ( 10 ), average_salary FLOAT ( 10 ) );
--when `salary` LIKE '%面議%',min / max / average = 0
UPDATE liepin
SET min_salary = 0,
max_salary = 0,
average_salary = 0
WHERE
salary LIKE '%面議%';
--salary不是面談時(shí)角雷,salary格式 'min-max萬' --min_salary = min
UPDATE liepin
SET min_salary = SUBSTRING_INDEX( salary, '-', 1 )
WHERE
salary LIKE '%-%萬%';
--max_Salary = max
UPDATE liepin
SET max_salary = (
SUBSTRING( SUBSTRING_INDEX( salary, '萬', 1 ), LOCATE( '-', SUBSTRING_INDEX( salary, '萬', 1 ) ) + 1 )
)
WHERE
salary LIKE '%-%萬%';
--average_salary = ( min + max ) / 2
UPDATE liepin
SET average_salary = ( min_salary + max_salary ) / 2
WHERE
salary LIKE '%-%萬%';
工作地域
將工作地域分成兩個(gè)字段祸穷,分成城市和區(qū)域,例如:字段1:北京? 字段2:朝陽區(qū)
ALTER TABLE liepin ADD (
position1 VARCHAR(255),
position2 VARCHAR(255)
);
UPDATE liepin SET position1 = position
WHERE position NOT LIKE '%-%';
UPDATE liepin SET position2 = 'null'
WHERE position NOT LIKE '%-%';
UPDATE liepin SET position1 = SUBSTRING_INDEX(`position`,'-',1)
WHERE position LIKE '%-%';
UPDATE liepin SET position2 = SUBSTRING(`position`,LOCATE('-',`position`)+1)
WHERE position LIKE '%-%';
職位要求
把職位要求分成學(xué)歷勺三、經(jīng)驗(yàn)雷滚、年齡和語言:
ALTER TABLE liepin ADD ( education VARCHAR ( 255 ), experience VARCHAR ( 255 ), age VARCHAR ( 255 ), LANGUAGE VARCHAR ( 255 ) );
--學(xué)歷
UPDATE liepin
SET education = substring_index( qualification, ' ', 1 );
--工作經(jīng)驗(yàn)
UPDATE liepin
SET experience = SUBSTRING( SUBSTRING_INDEX( qualification, ' ', 37 ), - 5 );
--年齡要求
UPDATE liepin
SET age = substring_index( qualification, ' ',- 1 );
--工作語言
UPDATE liepin
SET LANGUAGE = SUBSTRING( SUBSTRING_INDEX( qualification, ' ', 73 ), - 5 );
職位標(biāo)簽
將職位標(biāo)簽分開存入不同字段:
ALTER TABLE liepin ADD ( label1 VARCHAR ( 255 ), label2 VARCHAR ( 255 ) );
UPDATE liepin
SET label1 = SUBSTRING_INDEX( tag, ' ', 1 );
UPDATE liepin
SET label2 = SUBSTRING_INDEX( tag, ' ', - 1 );
行業(yè)
將行業(yè)簡化劃分:
ALTER TABLE liepin ADD ( industry1 VARCHAR ( 255 ) );
UPDATE liepin
SET industry1 = '其他'
WHERE
industry IS NOT NULL;
UPDATE liepin
SET industry1 = '互聯(lián)網(wǎng)/電子商務(wù)'
WHERE
industry LIKE '%互聯(lián)網(wǎng)%';
UPDATE liepin
SET industry1 = '投資'
WHERE
industry LIKE '%投資%';
UPDATE liepin
SET industry1 = '計(jì)算機(jī)軟件'
WHERE
industry LIKE '%計(jì)算機(jī)%';
UPDATE liepin
SET industry1 = 'IT服務(wù)'
WHERE
industry LIKE '%IT%';
UPDATE liepin
SET industry1 = '咨詢'
WHERE
industry LIKE '%咨詢%';
UPDATE liepin
SET industry1 = '保險(xiǎn)'
WHERE
industry LIKE '%保險(xiǎn)%';
UPDATE liepin
SET industry1 = '銀行'
WHERE
industry LIKE '%銀行%';
UPDATE liepin
SET industry1 = '服裝'
WHERE
industry LIKE '%服裝%';
UPDATE liepin
SET industry1 = '通信'
WHERE
industry LIKE '%通信%';
UPDATE liepin
SET industry1 = '食品'
WHERE
industry LIKE '%食品%';
UPDATE liepin
SET industry1 = '百貨'
WHERE
industry LIKE '%百貨%';
UPDATE liepin
SET industry1 = '游戲'
WHERE
industry LIKE '%游戲%';
UPDATE liepin
SET industry1 = '房地產(chǎn)'
WHERE
industry LIKE '%房地產(chǎn)%';
公司規(guī)模
將公司規(guī)模統(tǒng)一為:1-49人、50-99人吗坚、...祈远、10000人以上
ALTER TABLE liepin ADD ( size1 VARCHAR ( 255 ) );
UPDATE liepin
SET size1 = SUBSTRING_INDEX( size, ':',- 1 )
WHERE
size LIKE '%人%';
存入新表
將清洗過的,需要用來分析的數(shù)據(jù)存入新表
CREATE TABLE data (
id INT ( 255 ) auto_increment PRIMARY KEY,
jobtitle VARCHAR ( 255 ),
company VARCHAR ( 255 ),
average_salary FLOAT ( 10 ),
position1 VARCHAR ( 255 ),
position2 VARCHAR ( 255 ),
education VARCHAR ( 255 ),
experience VARCHAR ( 255 ),
age VARCHAR ( 255 ),
language VARCHAR ( 255 ),
industry1 VARCHAR ( 255 ),
size1 VARCHAR ( 255 ) ,
label1 VARCHAR ( 255 ),
label2 VARCHAR ( 255 )
);
ALTER TABLE data auto_increment = 1;
INSERT INTO data ( jobtitle, company, average_salary, position1, position2, education, experience, age, LANGUAGE, industry1, size1, label1, label2 ) SELECT DISTINCT
jobtitle,
company,
average_salary,
position1,
position2,
education,
experience,
age,
language,
industry1,
size1,
label1,
label2
FROM
liepin;
數(shù)據(jù)清洗部分結(jié)束商源。