搖滾樂經(jīng)過幾十年的發(fā)展排截,風(fēng)格流派眾多章鲤,從blues税产,到brit invasion怕轿,之后是punk,disco辟拷,indie rock等等撞羽。發(fā)展歷程大致是這樣的:
搖滾樂的聽眾,總是能體會(huì)到發(fā)現(xiàn)寶藏的快樂衫冻,可能突然就會(huì)邂逅某支自己不曾接觸過的歌曲诀紊、樂隊(duì)、風(fēng)格羽杰,感覺好聽得不行渡紫,以前怎么從來不知道到推,接下來的一段時(shí)間便會(huì)沉浸于此考赛,每天都在聽該風(fēng)格的主要樂隊(duì)和專輯。用戶收聽音樂在一段時(shí)間內(nèi)可能是有著某個(gè)“主題”的莉测,這個(gè)主題可能是地理上的(俄羅斯的搖滾樂隊(duì))颜骤,可能是時(shí)間上的(2000年后優(yōu)秀的專輯),還可能是某流派捣卤、甚至是都被某影視作品用作BGM忍抽。之前很少聽國內(nèi)搖滾的筆者,在去年聽了刺猬董朝、P.K.14鸠项、重塑雕像的權(quán)利、新褲子子姜、海朋森等一些國內(nèi)樂隊(duì)的很多作品后祟绊,才知道原來在老崔、竇唯、萬青牧抽、老謝之外還有這么多優(yōu)秀的國產(chǎn)搖滾樂嘉熊。
這種“在某一時(shí)間會(huì)被用戶放到一起聽”的co-occurrence歌曲列表在音樂軟件里的形態(tài)是playlist或radio,由editor或用戶編輯生成扬舒,當(dāng)然阐肤,還有”專輯“這個(gè)很強(qiáng)的聯(lián)系,特別是像《The Dark Side of the Moon》這樣的專輯讲坎。然而在前幾篇文章提到的內(nèi)容中孕惜,最為核心的數(shù)據(jù)結(jié)構(gòu)是用戶物品關(guān)系矩陣,這里面并沒有包含”一段時(shí)間“這個(gè)信息衣赶。這段時(shí)間可以稱為session诊赊,在其他領(lǐng)域的實(shí)際應(yīng)用中,這個(gè)session可能是一篇研究石墨烯的論文府瞄,可能是一個(gè)Airbnb用戶某天在30分鐘內(nèi)尋找夏威夷租房信息的點(diǎn)擊情況碧磅。把session內(nèi)的co-occurrence關(guān)系考慮進(jìn)去,可以為用戶做出更符合其當(dāng)下所處情境的推薦結(jié)果遵馆。
這篇文章使用Word2vec處理Last.fm 1K數(shù)據(jù)集鲸郊,來完成這種納入session信息的歌曲co-occurrence關(guān)系的建立。
Word2vec與音樂推薦
Word2vec最初被提出是為了在自然語言處理(NLP)中用一個(gè)低維稠密向量來表示一個(gè)word(該向量稱為embedding)货邓,并進(jìn)一步根據(jù)embedding來研究詞語之間的關(guān)系秆撮。它使用一個(gè)僅包含一層隱藏層的神經(jīng)網(wǎng)絡(luò)來訓(xùn)練被分成許多句子的數(shù)據(jù),來學(xué)習(xí)詞匯之間的co-occurrence關(guān)系换况,其中訓(xùn)練時(shí)分為CBOW(Continuous Bag-of-Words)與Skip-gram兩種方式职辨,這里簡單說一下使用Skip-gram獲取embedding的過程。
假設(shè)拿到了一些句子作為數(shù)據(jù)集戈二,要為該神經(jīng)網(wǎng)絡(luò)生成訓(xùn)練樣本舒裤,這里要定義一個(gè)窗口大小比如為2,則對(duì)"shine on you crazy diamond"這句話來講觉吭,將窗口從左滑到右腾供,按照下圖方式生成一系列單詞對(duì)兒,其中每個(gè)單詞對(duì)兒即作為一個(gè)訓(xùn)練樣本鲜滩,單詞對(duì)兒中的第一個(gè)單詞為輸入伴鳖,第二個(gè)單詞為label。
假設(shè)語料庫中有10000個(gè)互不相同的word徙硅,首先將某個(gè)單詞使用one-hot vector(10000維)來表示輸入神經(jīng)網(wǎng)絡(luò)榜聂,輸出同樣為10000維的vector,每一維上的數(shù)字代表此位置為1所代表的one-hot vector所對(duì)應(yīng)的word在輸入word周圍的可能性:
輸入輸出層的節(jié)點(diǎn)數(shù)為語料庫word數(shù)嗓蘑,隱藏層的節(jié)點(diǎn)數(shù)則為表示每個(gè)單詞的向量的維數(shù)须肆。此模型每個(gè)輸入層節(jié)點(diǎn)會(huì)與隱藏層的每個(gè)節(jié)點(diǎn)相連且都對(duì)應(yīng)了一個(gè)權(quán)重贴汪,而對(duì)某輸入節(jié)點(diǎn)來說,它與隱藏層相連的所有這些權(quán)重組成的向量即為該節(jié)點(diǎn)為1所代表的one-hot vector所對(duì)應(yīng)的單詞的embedding向量休吠。
但該模型其實(shí)并不了解語義扳埂,它擁有的只是統(tǒng)計(jì)學(xué)知識(shí),那么既然可以根據(jù)one-hot vector來標(biāo)識(shí)一個(gè)word瘤礁,當(dāng)然可以用這種形式來標(biāo)識(shí)一首歌曲阳懂,一支樂隊(duì),一件商品柜思,一間出租屋等等任何可以被推薦的東西岩调,再把這些數(shù)據(jù)喂給模型,同樣的訓(xùn)練過程赡盘,便可以獲取到各種物品embedding号枕,然后研究它們之間的關(guān)系,此謂Item2vec陨享,萬物皆可embedding葱淳。
故Word2vec與音樂推薦的關(guān)系就是,把一個(gè)歌單或者某個(gè)user在一個(gè)下午連續(xù)收聽的歌曲當(dāng)作一句話(session)抛姑,把每首歌當(dāng)作一個(gè)獨(dú)立的word赞厕,然后把這樣的數(shù)據(jù)交給此模型去訓(xùn)練即可獲取每首歌的embedding向量,這里從歌單到一句話的抽象定硝,即實(shí)現(xiàn)了上文中提到的考慮進(jìn)去“一段時(shí)間”這個(gè)點(diǎn)皿桑。
加載數(shù)據(jù)
該數(shù)據(jù)集包含了1K用戶對(duì)960K歌曲的收聽情況,文件1915萬行蔬啡,2.4G诲侮,每行記錄了某用戶在某時(shí)間播放了某歌曲的信息。依然是用pandas把數(shù)據(jù)加載進(jìn)來箱蟆,這次需要timestamp的信息沟绪。
import arrow
from tqdm import tqdm
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix, diags
df = pd.read_csv('~/music-recommend/dataset/lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv',
sep = '\t',
header = None,
names = ['user_id', 'timestamp', 'artist_id', 'artist_name', 'track_id', 'track_name'],
usecols = ['user_id', 'timestamp', 'track_id', 'artist_name', 'track_name'],
)
df = df.dropna()
print (df.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16936136 entries, 10 to 19098861
Data columns (total 5 columns):
user_id object
timestamp object
artist_name object
track_id object
track_name object
dtypes: object(5)
memory usage: 775.3+ MB
接下來做一些輔助的數(shù)據(jù),為每個(gè)user顽腾、每首track都生成一個(gè)用于標(biāo)識(shí)自己的index近零,建立從index到id诺核,從id到index的雙向查詢dict抄肖。
df['user_id'] = df['user_id'].astype('category')
df['track_id'] = df['track_id'].astype('category')
user_index_to_user_id_dict = df['user_id'].cat.categories # use it like a dict.
user_id_to_user_index_dict = dict()
for index, i in enumerate(df['user_id'].cat.categories):
user_id_to_user_index_dict[i] = index
track_index_to_track_id_dict = df['track_id'].cat.categories # use it like a dict.
track_id_to_track_index_dict = dict()
for index, i in enumerate(df['track_id'].cat.categories):
track_id_to_track_index_dict[i] = index
song_info_df = df[['artist_name', 'track_name', 'track_id']].drop_duplicates()
考慮到專輯翻唱、同名窖杀、專輯重新發(fā)行等情況漓摩,需要用track_id來作為一首歌的唯一標(biāo)識(shí),而當(dāng)需要通過artist_name
入客,track_name
來定位到一首歌時(shí)管毙,這里寫了一個(gè)函數(shù)腿椎,采取的策略是找到被播放最多的那一個(gè)。
def get_hot_track_id_by_artist_name_and_track_name(artist_name, track_name):
track = song_info_df[(song_info_df['artist_name'] == artist_name) & (song_info_df['track_name'] == track_name)]
max_listened = 0
hotest_row_index = 0
for i in range(track.shape[0]):
row = track.iloc[i]
track_id = row['track_id']
listened_count = df[df['track_id'] == track_id].shape[0]
if listened_count > max_listened:
max_listened = listened_count
hotest_row_index = i
return track.iloc[hotest_row_index]['track_id']
print ('wish you were here tracks:')
print (song_info_df[(song_info_df['artist_name'] == 'Pink Floyd') & (song_info_df['track_name'] == 'Wish You Were Here')][['track_id']])
print ('--------')
print ('hotest one:')
print (get_hot_track_id_by_artist_name_and_track_name('Pink Floyd', 'Wish You Were Here'))
wish you were here tracks:
track_id
60969 feecff58-8ee2-4a7f-ac23-dc8ce7925286
4401932 f479e316-56b4-4221-acd9-eed1a0711861
17332322 2210ba38-79af-4881-97ae-4ce8f32322c3
--------
hotest one:
feecff58-8ee2-4a7f-ac23-dc8ce7925286
生成sentences文件
加載過數(shù)據(jù)后接下來要生成在科普環(huán)節(jié)提到的由歌名歌單生成句子夭咬,由于懶啃炸,沒有去爬云音樂的歌單數(shù)據(jù),這里粗暴地將每個(gè)用戶每一天收聽的所有歌曲作為一個(gè)session卓舵,使用上文生成的track_index
來標(biāo)識(shí)各歌曲南用,將生成的sentences寫到磁盤上。
def generate_sentence_file(df):
with open('sentences.txt', 'w') as sentences:
for user_index in tqdm(range(len(user_index_to_user_id_dict))):
user_id = user_index_to_user_id_dict[user_index]
user_df = df[df['user_id'] == user_id].sort_values('timestamp')
session = list()
last_time = None
for index, row in user_df.iterrows():
this_time = row['timestamp']
track_index = track_id_to_track_index_dict[row['track_id']]
if arrow.get(this_time).date() != arrow.get(last_time).date() and last_time != None:
sentences.write(' '.join([str(_id) for _id in session]) + '\n')
session = list()
session.append(track_index)
last_time = this_time
generate_sentence_file(df)
100%|██████████| 992/992 [1:22:23<00:00, 5.62s/it]
生成后的文件長這個(gè)樣子:
訓(xùn)練模型生成embedding
有很多種方式可以獲取掏湾、實(shí)現(xiàn)Word2vec的代碼裹虫,可以用Tensorflow、Keras基于神經(jīng)網(wǎng)絡(luò)寫一個(gè)融击,亦可以使用Google放到Google Code上的Word2vec實(shí)現(xiàn)筑公,也可以在Github上找到gensim這個(gè)優(yōu)秀的庫使用其已經(jīng)封裝好的實(shí)現(xiàn)。
下列代碼使用smart_open
來逐行讀取之前生成的sentences.txt文件尊浪,對(duì)內(nèi)存很是友好匣屡。這里使用50維的向量來代表一首歌曲,將收聽總次數(shù)不到20次的冷門歌曲篩選出去拇涤,設(shè)窗口大小為5耸采。
from smart_open import smart_open
from gensim.models import Word2Vec
import logging
logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)
class LastfmSentences(object):
def __init__(self, file_location):
self.file_location = file_location
def __iter__(self):
for line in smart_open(self.file_location, 'r'):
yield line.split()
lastfm_sentences = LastfmSentences('./sentences.txt')
model = Word2Vec(lastfm_sentences, size=50, min_count=20, window=10, hs=0, negative=20, workers=4, sg=1, sample=1e-5)
假如訓(xùn)練的數(shù)據(jù)集為歌單,一個(gè)歌單為一個(gè)句子工育,由于出現(xiàn)在同一個(gè)歌單內(nèi)代表了其中歌曲的某種共性虾宇,那么會(huì)希望將所有item兩兩之間的關(guān)系都考慮進(jìn)去,故window size的取值可以取(所有歌單長度最大值-1)/2
如绸,會(huì)取得更好的效果嘱朽。這里由于是以用戶和天做分割,暫且拍腦袋拍出一個(gè)10怔接。
sample用于控制對(duì)熱門詞匯的采樣比例搪泳,降低太過熱門的詞匯對(duì)整個(gè)模型的影響,比如Radiohead的creep扼脐,這里面還有個(gè)計(jì)算公式不再細(xì)說岸军。
sg取0、1分別表示使用CBOW與Skip-gram算法瓦侮,而hs取0艰赞、1分別表示使用hierarchical softmax與negative sampling。
關(guān)于negative sampling值得多說兩句肚吏,在神經(jīng)網(wǎng)絡(luò)的訓(xùn)練過程中需要根據(jù)梯度下降去調(diào)整節(jié)點(diǎn)之間的weight方妖,可由于要調(diào)的weight數(shù)量巨大,在這個(gè)例子里為2*50*960000
罚攀,效率會(huì)很低下党觅,處理方法使用負(fù)采樣雌澄,僅選取此訓(xùn)練樣本的label為正例,其他隨機(jī)選取5到20個(gè)(經(jīng)驗(yàn)數(shù)值)單詞為反例杯瞻,僅調(diào)整與這幾個(gè)word對(duì)應(yīng)的weight镐牺,會(huì)使效率獲取明顯提升,并且效果也很良好魁莉。隨機(jī)選取的反例的規(guī)則亦與單詞出現(xiàn)頻率有關(guān)任柜,出現(xiàn)頻次越多的單詞,越有可能會(huì)被選中為反例沛厨。
利用embedding
現(xiàn)在已經(jīng)用大量數(shù)據(jù)為各track生成了與自己對(duì)應(yīng)的低維向量宙地,比如Wish You Were Here這首歌,這個(gè)embedding可以作為該歌曲的標(biāo)識(shí)用于其他機(jī)器學(xué)習(xí)任務(wù)比如learn to rank:
model.wv[str(track_id_to_track_index_dict[
get_hot_track_id_by_artist_name_and_track_name(
'Pink Floyd', 'Wish You Were Here')])]
array([-0.39100856, 0.28636533, 0.11853614, -0.41582254, 0.09754885,
0.59501815, -0.07997745, -0.28060785, -0.0384276 , -0.84899545,
0.03777567, -0.00727402, 0.6960302 , 0.44756493, -0.13245133,
-0.38473454, -0.07809031, 0.34377965, -0.19210865, -0.33457756,
-0.36364776, -0.06028108, 0.17379969, 0.46617758, -0.04116876,
0.07322323, 0.11769405, 0.42464802, 0.25167897, -0.35790011,
0.01991512, -0.10950506, 0.26131895, -0.76148427, 0.48405901,
0.61935854, -0.59583783, 0.28353232, -0.14503367, 0.3232002 ,
1.00872386, -0.10348291, -0.0485305 , 0.21677236, -1.33224928,
0.57913464, -0.06729769, -0.32185984, -0.02978219, -0.43034038], dtype=float32)
這些embedding vector之間的相似度可以表示兩首歌出現(xiàn)在同一session內(nèi)的可能性大心嫫ぁ:
shine_on_part_1 = str(track_id_to_track_index_dict[
get_hot_track_id_by_artist_name_and_track_name('Pink Floyd', 'Shine On You Crazy Diamond (Parts I-V)')])
shine_on_part_2 = str(track_id_to_track_index_dict[
get_hot_track_id_by_artist_name_and_track_name('Pink Floyd', 'Shine On You Crazy Diamond (Parts Vi-Ix)')])
good_times = str(track_id_to_track_index_dict[
get_hot_track_id_by_artist_name_and_track_name('Chic', 'Good Times')])
print ('similarity between shine on part 1, 2:', model.wv.similarity(shine_on_part_1, shine_on_part_2))
print ('similarity between shine on part 1, good times:', model.wv.similarity(shine_on_part_1, good_times))
similarity between shine on part 1, 2: 0.927217
similarity between shine on part 1, good times: 0.425195
稍微看下源碼便會(huì)發(fā)現(xiàn)上述similarity函數(shù)宅粥,gensim也是使用余弦相似度來計(jì)算的,同樣可以根據(jù)該相似度电谣,來生成一些推薦列表秽梅,當(dāng)然不可能去遍歷,gensim內(nèi)部也是使用上篇文章提到的Annoy來構(gòu)建索引來快速尋找近鄰的剿牺。為了使用方便寫了如下兩個(gè)包裝函數(shù)企垦。
def recommend_with_playlist(playlist, topn=25):
if not isinstance(playlist, list):
playlist = [playlist]
playlist_indexes = [str(track_id_to_track_index_dict[track_id]) for track_id in playlist]
similar_song_indexes = model.wv.most_similar(positive=playlist_indexes, topn=topn)
return [track_index_to_track_id_dict[int(track[0])] for track in similar_song_indexes]
def display_track_info(track_ids):
track_info = {
'track_name': [],
'artist_name': [],
}
for track_id in track_ids:
track = song_info_df[song_info_df['track_id'] == track_id].iloc[0]
track_info['track_name'].append(track['track_name'])
track_info['artist_name'].append(track['artist_name'])
print (pd.DataFrame(track_info))
接下來假裝自己在聽后朋,提供幾首歌曲晒来,看看模型會(huì)給我們推薦什么:
# post punk.
guerbai_playlist = [
('Joy Division', 'Disorder'),
('Echo & The Bunnymen', 'The Killing Moon'),
('The Names', 'Discovery'),
('The Cure', 'Lullaby'),
]
display_track_info(recommend_with_playlist([
get_hot_track_id_by_artist_name_and_track_name(track[0], track[1])
for track in guerbai_playlist], 20))
track_name artist_name
0 Miss The Girl The Creatures
1 Splintered In Her Head The Cure
2 Return Of The Roughnecks The Chameleons
3 P.S. Goodbye The Chameleons
4 Chelsea Girl Simple Minds
5 23 Minutes Over Brussels Suicide
6 Not Even Sometimes The Prids
7 Windows A Flock Of Seagulls
8 Ride The Friendly Skies Lightning Bolt
9 Inmost Light Double Leopards
10 Thin Radiance Sunroof!
11 You As The Colorant The Prids
12 Love Will Tear Us Apart Boy Division
13 Slip Away Ultravox
14 Street Dude Black Dice
15 Touch Defiles Death In June
16 All My Colours (Zimbo) Echo & The Bunnymen
17 Summernight The Cold
18 Pornography (Live) The Cure
19 Me, I Disconnect From You Gary Numan
好多樂隊(duì)都沒見過钞诡,wiki一下發(fā)現(xiàn)果然大都是后朋與新浪潮樂隊(duì)的歌曲,搞笑的是Love Will Tear Us Apart竟然成了Boy Division的了湃崩,這數(shù)據(jù)集有毒荧降。。
過了半年又沉浸在前衛(wèi)搖滾的長篇里:
# long progressive
guerbai_playlist = [
('Rush', '2112: Ii. The Temples Of Syrinx'),
('Yes', 'Roundabout'),
('Emerson, Lake & Palmer', 'Take A Pebble'),
('Jethro Tull', 'Aqualung'),
]
display_track_info(recommend_with_playlist([
get_hot_track_id_by_artist_name_and_track_name(track[0], track[1])
for track in guerbai_playlist]))
track_name artist_name
0 Nutrocker Emerson, Lake & Palmer
1 Brain Salad Surgery Emerson, Lake & Palmer
2 Black Moon Emerson, Lake & Palmer
3 Parallels Yes
4 Working All Day Gentle Giant
5 Musicatto Kansas
6 Farewell To Kings Rush
7 My Sunday Feeling Jethro Tull
8 Thick As A Brick, Part 1 Jethro Tull
9 South Side Of The Sky Yes
10 Living In The Past Jethro Tull
11 The Fish (Schindleria Praematurus) Yes
12 Starship Trooper Yes
13 Tank Emerson, Lake & Palmer
14 I Think I'M Going Bald Rush
15 Here Again Rush
16 Lucky Man Emerson, Lake & Palmer
17 Cinderella Man Rush
18 Stick It Out Rush
19 The Speed Of Love Rush
20 New State Of Mind Yes
21 Karn Evil 9: 2Nd Impression Emerson, Lake & Palmer
22 A Venture Yes
23 Cygnus X-1 Rush
24 Sweet Dream Jethro Tull
人是會(huì)變的攒读,今天她喜歡聽后朋朵诫,明天可能喜歡別的,但既然我們有數(shù)學(xué)與集體智慧薄扁,這又有什么關(guān)系呢剪返?
參考
Using Word2vec for Music Recommendations
Word2Vec Tutorial - The Skip-Gram Model