Recycling Deep Learning Models with Transfer Learning

Recycling Deep Learning Models with Transfer Learning

Deep learning exploits gigantic datasets to produce powerful models. But what can we do when our datasets are comparatively small? Transfer learning by fine-tuning deep nets offers a way to leverage existing datasets to perform well on new tasks.

ByZachary Chase Lipton

Deep learning models are indisputably the state of the art for many problems in machine perception. Using neural networks with many hidden layers of artificial neurons and millions of parameters, deep learning algorithms exploit both hardware capabilities and the abundance of gigantic datasets. In comparison,linear models saturate quickly and under-fit.But what if you don't have big data?

Say you have a novel research problem, such as identifying cancerous moles given photographs. Assume further that you generated a dataset of 10,000 labeled images. While this might seem like big data, it's a pittance by comparison to the authoritative datasets on which deep learning has its greatest successes. It might seem that all is lost. Fortunately there's hope.

Consider the famous ImageNet database created by Stanford ProfessorFei-Fei Li. The dataset contains millions of images, each belonging to 1 of 1000 distinct hierarchically organized object categories. AsDr. Li relates in her TED talk, a child sees millions of distinct images while growing up. Intuitively, to train a model strong enough to compete with human object recognition capabilities, a similarly large training set might be required. Further, the methods which dominate at this scale of data might not be those which dominate on the same task given smaller datasets. Dr. Li's insight quickly paid off. Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton handily won the 2012 ImageNet competition, establishing convolutional neural networks (CNNs) as the state of the art in computer vision.

To extend Dr. Li's metaphor, although a child sees millions of images throughout development, once grown, a human can easily learn to recognize a new class of object, without learning to see from scratch. Now consider a CNN trained for object recognition on the ImageNet dataset. Given an image from any among the 1000 ImageNet categories, the top-most hidden layer of the neural network constitutes a single vector representation sufficiently flexible to be useful for classifying images into each of the 1000 categories. It seems reasonable to guess that this representation would also be useful for an additional as-of-yet unseen object category.

Following this line of thinking, computer vision researchers now commonly use pre-trained CNNs to generate representations for novel tasks, where the dataset may not be large enough to train an entire CNN from scratch. Another common tactic is to take the pre-trained ImageNet network and then to fine-tune the entire network to the novel task. This is typically done by training end-to-end with backpropagation and stochastic gradient descent. I first learned about this fine-tuning technique last year in conversations withOscar Beijbom. Beijbom, a vision researcher at UCSD, has successfully used this technique to differentiate pictures of various types of coral.

A terrific paper from last year's NIPs conferenceby Jason Yosinski of Cornell, in collaboration with Jeff Clune, Yoshua Bengio and Hod Lipson, tackles this issue with a rigorous empirical investigation. The authors focus on the ImageNet dataset and the 8-layer AlexNet architecture of ImageNet fame. They note that the lowest layers of convolutional neural networks have long been known to resemble conventional computer vision features like edge detectors and Gabor filters. Further, they note that the topmost hidden layer is somewhat specialized to the task it was trained for. Therefore they systematically explore the following questions:

Where does the transition from broadly useful to highly specialized features take place?

What is the relative performance of fine-tuning an entire network vs. freezing the transferred layers?

To study these questions, the authors restrict their attention to the ImageNet dataset, but consider two 50-50 splits of the data into subsets A and B. In the first set of experiments, they split the data by randomly assigning each image category to either subset A or subset B. In the second set of experiments, the authors consider a split with greater contrast, assigning to one set only natural images and to the other only images of man-made objects.

The authors examine the case in which the network is pre-trained on task A and then trained further on B, keeping the first k layers and randomly initializing. They also consider the same experiment but pre-training on task B, randomly initializing the top 8-k layers, and then continuing to train on the same task B.

In all cases, they find that fine-tuning end-to-end on a pre-trained network outperforms freezing the transferred k layers completely. Interestingly they also find that when pre-trained layers are frozen, the transferability of features is nonlinear in the number of pre-trained layers. They hypothesize that this owes to complex interactions between the nodes in adjacent layers which are difficult to relearn.

To wax poetic, this might be analogous to performing a hemispherectomy on a fully grown human, replacing the right half of the brain with a blank slate child's brain. Ignoring the biological absurdity of this example it might seem intuitive that a fresh right brain hemisphere would have trouble filling the expected role of one to which the left hemisphere was fully adapted.

I defer tothe original paperfor a complete description of the experiments. One question which is not addressed in this paper but could inspire future research, is how these results hold up when the new task has far fewer examples than the original task. In practice, this imbalance in the number of labeled examples between the original task and the new one, is often what motivates transfer learning. A preliminary approach could be to repeat this exact work but with a 999 to 1 split instead of a 500-500 split of the categories.

Generally, data labeling is expensive. For many tasks it may even be prohibitively expensive. Thus it's of economic and academic interest to develop a deeper understanding of how we could leverage the labels we do have to perform well on tasks not as well endowed. Yosinski's work is a great step in this direction.

Zachary Chase Lipton

is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the

Division of Biomedical Informatics

, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs.

Related:

Deep Learning and the Triumph of Empiricism

Not So Fast: Questioning Deep Learning IQ Results

The Myth of Model Interpretability

(Deep Learning’s Deep Flaws)’s Deep Flaws

Data Science’s Most Used, Confused, and Abused Jargon

Differential Privacy: How to make Privacy and Data Mining Compatible

Previous post

Next post

Most popular last 30 days

Most viewed last 30 days

50+ Data Science and Machine Learning Cheat Sheets- Jul 14, 2015.

R vs Python for Data Science: The Winner is ...- May 26, 2015.

9 Must-Have Skills You Need to Become a Data Scientist- Nov 22, 2014.

Top 20 Python Machine Learning Open Source Projects- Jun 1, 2015.

Top 10 Data Analysis Tools for Business- Jun 13, 2014.

Can deep learning help find the perfect date?- Jul 10, 2015.

Stop Hiring Data Scientists Until You Are Ready for Data Science- Jul 17, 2015.

Deep Learning and the Triumph of Empiricism- Jul 7, 2015.

Most shared last 30 days

50+ Data Science and Machine Learning Cheat Sheets- Jul 14, 2015.

Impact of IoT on Big Data Landscape- Jul 29, 2015.

Deep Learning Adversarial Examples - Clarifying Misconceptions- Jul 15, 2015.

Stop Hiring Data Scientists Until You Are Ready for Data Science- Jul 17, 2015.

Data is Ugly - Tales of Data Cleaning- Aug 1, 2015.

Can deep learning help find the perfect date?- Jul 10, 2015.

R, Python users show surprising stability, but strong regional differences- Jul 14, 2015.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末昏翰,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 217,084評(píng)論 6 503
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件华临,死亡現(xiàn)場(chǎng)離奇詭異础淤,居然都是意外死亡资铡,警方通過(guò)查閱死者的電腦和手機(jī)怕犁,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,623評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門(mén),熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人奏甫,你說(shuō)我怎么就攤上這事戈轿。” “怎么了阵子?”我有些...
    開(kāi)封第一講書(shū)人閱讀 163,450評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵思杯,是天一觀(guān)的道長(zhǎng)。 經(jīng)常有香客問(wèn)我挠进,道長(zhǎng),這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,322評(píng)論 1 293
  • 正文 為了忘掉前任芭梯,我火速辦了婚禮酱畅,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘君旦。我一直安慰自己澎办,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,370評(píng)論 6 390
  • 文/花漫 我一把揭開(kāi)白布金砍。 她就那樣靜靜地躺著局蚀,像睡著了一般。 火紅的嫁衣襯著肌膚如雪恕稠。 梳的紋絲不亂的頭發(fā)上琅绅,一...
    開(kāi)封第一講書(shū)人閱讀 51,274評(píng)論 1 300
  • 那天,我揣著相機(jī)與錄音鹅巍,去河邊找鬼千扶。 笑死,一個(gè)胖子當(dāng)著我的面吹牛昆著,可吹牛的內(nèi)容都是我干的县貌。 我是一名探鬼主播,決...
    沈念sama閱讀 40,126評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼凑懂,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼煤痕!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起接谨,我...
    開(kāi)封第一講書(shū)人閱讀 38,980評(píng)論 0 275
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤摆碉,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后脓豪,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體巷帝,經(jīng)...
    沈念sama閱讀 45,414評(píng)論 1 313
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,599評(píng)論 3 334
  • 正文 我和宋清朗相戀三年扫夜,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了楞泼。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片驰徊。...
    茶點(diǎn)故事閱讀 39,773評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖堕阔,靈堂內(nèi)的尸體忽然破棺而出棍厂,到底是詐尸還是另有隱情,我是刑警寧澤超陆,帶...
    沈念sama閱讀 35,470評(píng)論 5 344
  • 正文 年R本政府宣布牺弹,位于F島的核電站,受9級(jí)特大地震影響时呀,放射性物質(zhì)發(fā)生泄漏张漂。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,080評(píng)論 3 327
  • 文/蒙蒙 一谨娜、第九天 我趴在偏房一處隱蔽的房頂上張望航攒。 院中可真熱鬧,春花似錦瞧预、人聲如沸屎债。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,713評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)盆驹。三九已至,卻和暖如春滩愁,著一層夾襖步出監(jiān)牢的瞬間躯喇,已是汗流浹背。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,852評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工硝枉, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留廉丽,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 47,865評(píng)論 2 370
  • 正文 我出身青樓妻味,卻偏偏與公主長(zhǎng)得像正压,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子责球,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,689評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容