50 things I learned at NIPS?2016 by Andreas Stuhlmüller

注明：本文全文轉(zhuǎn)載https://blog.ought.com/nips-2016-875bb8fadb8c

I learned many things about AI and machine learning at the NIPS 2016 conference. Here are a few that are particularly suited to being communicated in the space of a few sentences.

I’ve attempted to link to the person or people who inspired a particular thought, but there’s a lot of variation in how direct the connection is, and any particular item may not reflect the opinion of the linked person.

5680 people registered for NIPS this year

Applied machine learning

What methods win Kaggle competitions? Gradient tree boosting(especially XGBoost) and deep neural nets (especially convolutional netsfor images and RNNs for some time series problems). ??
Ensembles add 2–5% in performance over the best individual methods, but also lead to more complex systems, so are often not worth it in practice. ??
Current machine learning techniques work best when training data and real data come from the same distribution. When it’s likely that an algorithm will be applied in a setting that is different from the training setting, it can be good to have the test set come from a different distribution than the training set, hopefully mirroring how the real application data will again come from a different distribution. This way, you get a better sense for how the algorithm does under distribution shift. ??
More specifically, if you have two sources of data—say, a large set of general speech data and a much smaller set of in-car speech data—and you want to build a supervised learner that does well on the small set, Andrew Ng recommends this recipe that involves splitting each of the two sets and then step-by-step reducing each of four kinds of errors. ??

The view from our hotel room in the morning

Neural nets

Why does deep learning work now, but not 20 years ago, even though many of the core ideas were there? In one sentence: We have more data, more compute, better software engineering, and a few algorithmic innovations (many layers, ReLUs, better initialization and learning rates, dropout, LSTMs). ??
But why does gradient-based optimization work at all in neural nets despite the non-convexity? One possible, partial answer is overprovisioning: There are generally many hidden units, and there are many ways a neural net can approximately implement the desired input-output relationship. You only need to find one. ??
There’s a potentially more biologically plausible alternative to backprop called equilibrium propagation that requires neither an explicit loss function nor gradients. Training works something like this: (1) Clamp the input of the system to some input value. (2) Let the system converge until there is a stable predicted output. (3) Measure some stats within the system. (4) Clamp the output to the true output value. (5) Let the system converge again. (6) Measure the same stats as before. (7) Update the system’s parameters based on the difference in stats. ??
If you take an LSTM and add a “time gate” that controls at what frequency to be open to new input and how long to be open each time, you can have different neurons that learn to look at a sequence with different frequencies, create a “wormhole” for gradients, save compute, and do better on long sequences and when you need to process inputs from multiple sensors that are sampled at different rates. ??

The conference venue

Interacting with humans

Want to communicate a large dataset (of images, say) to a human using exemplars? One thing you can do is to first find a few prototypes (based on minimizing Maximum Mean Discrepancy between prototype and data distribution), then add a few particularly atypical instances where the prototype and data distributions differ most (by maximizing MMD). ??
Here’s one approach to making robots less annoying: When the human says, “do x now”, make the robot directly execute the command x. Then use data about when such commands happen to learn what to do when the human doesn’t give commands. ??
If we evaluate people’s questions as they try to figure out where ships are in a game of battleship, we find that people can judge which questions are best (according to expected information gain). But, for the most part, the questions they come up with themselves aren’t the most informative ones. ??
What can you do if you want to elicit people’s true beliefs in a crowdsourcing setting where you don’t have access to the ground truth, such as when you ask “Is this essay well-reasoned?” Yes, you could use the Bayesian Truth Serum, but what if you don’t want to ask subjects difficult meta-level questions? As long as you have multiple independent tasks, you can use the Correlated Agreement Mechanism, which suggests that you reward people when they agree on correlated tasks, and punish people for agreement on uncorrelated tasks. ??
When users interact with machine learning systems, it’s not just the systems that are learning—the users’ model of the system changes as well, but this is mostly neglected. How can we model this co-learning process? ??
It’s often useful to have a human in-the-loop when we build machine learning systems (e.g. so that the system can actively delegate particularly difficult tasks to the human, or ask questions). But we can’t differentiate through human minds (yet), which prevents gradient-based end-to-end optimization of the other components. Is there anything we can do about this? ??

The [Sagrada Família](https://en.wikipedia.org/wiki/Sagrada_Fam%C3%ADlia) from the inside

Bayes in the time of neurons

It’s appealing to consider building a posterior on neural net parameters instead of searching for a single good parameter setting?—?in fact, so appealing that the key ideas of Bayesian neural nets have been developed around 1987–1995. See Yarin Gal’s thesis for a brief history. ??
People generally appreciate that this is a difficult task, but it’s easy to forget just how difficult this may be for real problems such as high-res image synthesis: the dimensionality of the parameter space is generally huge, much larger than the dimensionality of the input space, which may already be quite high-dimensional. ??
On the other hand, a version of Stochastic Gradient HMC seems to make Bayesian Neural Nets useful enough for Bayesian Optimization of the hyperparameters of another (non-Bayesian) neural net on real tasks. ??
Speaking of Bayesian Optimization: For short horizons (up to 30 steps), a neural net trained to do black-box optimization can do better than the standard Bayesian Gaussian Process approach and is a lot faster. ??
How can we deal with adversarial examples? One might hope that simply putting a prior on parameters and being Bayesian would do the trick. It doesn’t. Instead, the existence of such examples seems to have something to do with the fact that our models make overly confident, linear extrapolations. So, specific priors might help, but it doesn’t seem obvious how to find ones that prevent such examples and don’t hurt generalization a lot. ?? ??

Barcelona

Planning & reinforcement learning

X isn’t about Y, now for artificial agents as well: Let’s model planners who choose not (just) based on whether their actions achieve some immediate goal in the world, but based on how well these plans signal something about the agent (such as the agents’s goals). ??
There are a number of approaches to hierarchical planning, including Hierarchical Abstract Machines, MaxQ, Skills, Dynamic Motion Primitives, and Options. So far, it has been a challenge to learn andbenefit from the relevant abstractions within a single task, i.e. in the non-amortized setting. A new Option-Critic architecture seems to do somewhat better than Deep Q-Learning within some individual Atari games, but it doesn’t look like a big win yet. You can read this as an argument against current approaches, as an argument for the amortized setting, or both. ??
Here’s how you might start to approach hierarchical planning with Deep RL: Replace the usual function Policy(State) → Action with a parameterized function Policy(State, Task) → Action. An action can either (a) recurse a level, using the same policy as before, but with a new task vector as input, (b) execute an action in the world, or (c) terminate the subtask and pop up a level. ??
Experience replay is a bit of a hack. Ultimately, we’ll need something smarter. ??
Value iteration is similar enough to a sequence of convolutions and max-pooling layers that you can emulate an (unrolled) planning computation with a deep network: a value iteration network. This allows neural nets to do planning, e.g. moving from start to goal in grid-world, or navigating a website to find query. ??

The Uber party

Reinforcement learning, in more depth

Which Deep RL methods work best? This review provides a big table comparing a number of state-of-the-art methods on multiple continuous control tasks, but all fail at tasks with hierarchical structure. ??
To make Deep RL work in practice, take a look at these tips & tricks from John Schulman’s Deep RL course. ??
One general lesson is that, as you iteratively improve your policy, it’s important to constrain the KL divergence between the old and new policy to be less than some constant δ. This δ (in the unit of nats) is better than a fixed step size, since the meaning of the step size changes depending on what the rewards and problem structure look like at different points in training. This is called Trust Region Policy Optimization (or, in a first-order variant, Proximal Policy Optimization) and it matters more as we do more experience replay. ?? ??
If your policy has a small number of parameters (say 20), and sometimes even if it has a moderate number (say 2000), you might be better off using the Cross-Entropy Method than any of the fancy methods above. It works like this: (1) Sample n sets of parameters from some prior that allows for closed-form updating, e.g. a multivariate Gaussian. (2) For each parameter set, compute a noisy score by running your policy on the environment you care about. (3) Take the top 20% percentile (say) of sampled parameter sets. Fit a Gaussian distribution to this set, then go to (1) and repeat using this as the new prior. ??
For both RL and variational inference, there are two widely known waysof optimizing a policy (or variational distribution) based on sampled sequences of actions and outcomes: There’s (a) the likelihood-ratio estimator, which updates the policy such that action sequences that lead to higher scores happen more often and that doesn’t need gradients, and (b) the pathwise estimator, which adjusts individual actions such that the policy results in a higher score and that needs gradients. I previously assumed that, if you can use the pathwise estimator, it’s strictly better—but, for RL, it’s apparently the case that, while pathwise methods may be more sample-efficient, they work less generally due to high bias and don’t scale up as well to very high-dimensional problems. (Really?) ??
Suppose you want to train a neural net policy that can solve a fairly broad class of problems. Here’s one approach: (1) Sample 10 instances of the problem, and solve each of the instances using a problem-specific method, e.g. a method that fits and uses an instance-specific model. (2) Train the neural net to agree with all of the per-instance solutions. But if you’re going to do that, you might do even better by constraining the specific solutions and what the neural net policy would do to be close to each other from the start, fitting both simultaneously. ??

Barcelona

Generative adversarial nets

If you have a finite dataset {x1, x2, …}, say of images, and you want to generate more instances of x; or if you have pairs {(x1, y1), (x2, y2), …}and, given a new x, you would like to predict the corresponding y; then generative adversarial nets may be for you. At least in the domain of images, some of the most impressive results on such problems have been achieved using GANs. ??
How do GANs work? There are two parameterized differentiable functions, a generator G (think “counterfeiter”) and a discriminator D(think “police”). For x chosen from your dataset, we’ll optimize D(x) to be near 1. For x sampled from the generator using input noise z, i.e. x=G(z), we’ll optimize D(x) to be near 0, but at the same time, we’ll optimize G such that D(G(z)) is near 1. If all goes well, we’ll step-by-step optimize the discriminator until it’s really good at distinguishing generated from real data, and at the same time optimize the generator to be really good at sampling data that is indistinguishable from the real thing. ??
If all goes well. So far, GANs are really finicky to train. Here’s a list of hacks that sometimes help. ??
Why are GAN image samples so sharp, whereas variational autoencodersamples aren’t? One hypothesis is that it has something to do with the fact that the loss function for VAEs is the likelihood. But we can make GANs maximize likelihood as well, and GAN samples are still sharp, so this seems less plausible now. The reason probably has more to do with the fact that VAEs typically use a Gaussian likelihood, or perhaps with some other component of the model architecture, such as the particular approximation strategy used (e.g., VAEs optimize a lower bound). ?? ????
Three big open problems for GANs: (1) How do you address the fact that the minimax game between the generator and discriminator may never approach an equilibrium? In other words, how do you build a system using GANs so that you know that it will converge to a good solution? (2) Even if they do converge, current systems still have issues with global structure: they cannot count (e.g. the number of eyes on a dog) and frequently get long-range connections wrong (e.g. they show multiple perspectives as part of the same image). (3) How can we use GANs in discrete settings, such as for generating text? ??

The conference venue

Chat bots

What methods work best for dialog automation right now? This depends on what exactly the task is, but overall, some form of RNNs with extra memory seem to do best, and in particular seem to do better than n-grams and information retrieval methods (such as nearest neighbors and TF-IDF). Candidate architectures include LSTMs, the Hierarchical Recurrent Encoder-Decoder, a version thereof that adds a latent variableand that can perhaps handle ambiguity and uncertainty better, Multiresolution RNNs that attempt to learn some compositional structure, End-to-End Memory Networks, and an improved version of those. My impression is that nothing works really well so far. ?? ??

On the other hand, Kaggle’s Allen AI Science Challenge, which required algorithmic participants to answer multiple-choice questions from a standardized 8th grade science exam, was won using information retrieval methods, not RNNs. ??
In dialog automation, one of the biggest challenges is in building up an accurate picture (or state) that summarizes the dialog so far. ??
At Facebook, people are pursuing multiple approaches to dialog automation, but the main one is to go directly from dialog history to next response, without transparent intermediate state that can be used for training/evaluation. ??
Facebook’s bAbI tasks include a variety of dialog tasks, including transactions (making a restaurant reservation), Q&A, recommendation, and chit-chat. The Ubuntu dataset with almost 1 million tech troubleshooting dialogs is another useful resource. ?? ??
At the moment, some researchers build user simulators to train their dialog systems, but those are difficult to create?—?a simulator is effectively another dialog system, but it needs to mimic user behavior, and it’s hard to evaluate how well it is doing (in contrast to the dialog system that is being trained, there’s no notion of “task completion”). ??
If you can’t collect huge numbers of dialogs from real users, what can you do? One strategy is to first learn a semantic representation based on other datasets to “create a space in which reasoning can happen”, and then start using this pre-trained system for dialogs. ??

The view from our hotel room at night

Idea generators

Everything is an algorithm: It may be useful to view web experiments in the social sciences more explicitly as algorithms. Among other things, this makes it clearer that experimental design can take inspiration from existing algorithms, as in the case of MCMC with People. See also: If we formalize existing RL approaches such as training in simulation and reward shaping by writing them down as explicit protocol programs, maybe we can make it easier to incrementally improve these protocols. (I did some work on this project.) ?? ??
Take some computation where you usually wouldn’t keep around intermediate states, such as a planning computation (say value iteration, where you only keep your most recent estimate of the value function) or stochastic gradient descent (where you only keep around your current best estimate of the parameters). Now keep around those intermediate states as well, perhaps reifying the unrolled computation in a neural net, and take gradients to optimize the entire computation with respect to some loss function. Instances: Value Iteration Networks, Learning to learn by gradient descent by gradient descent. ?? ??
If we can overcome adversarial examples, we can train a neural net by giving it the score for a few prototypes—say designs for cars, and the rating a human designer assigned—and then use gradient descent on the inputs to synthesize exemplars that are better than any of the ones we can imagine. We have a “universal engineering machine”, if you like. ??
How can we implement high-level symbolic architectures using biological neural nets? Josh Tenenbaum now calls this “the modern mind-body problem”. ??
Neural nets still contain a lot of discrete structure, e.g. how many neurons there are, how many layers, what activation functions we use, and what’s connected to what. Is there a way to make it all continuous, so that we can run gradient descent on both parameters and structure, with no discrete parts at all? ??

The view from the Passion tower of the Sagrada Família

Tidbits and factoids

20 years ago, Jürgen Schmidhuber’s first submission on LSTMs got rejected from NIPS. ??
For some products at Baidu, the main purpose is to acquire data from users, not revenue. ??
Boston Dynamics doesn’t use any learning in their robots (so far), including the new Spot Mini demoed at NIPS—it’s all manually programmed. ??
For speech recognition, ML algorithms are now benchmarked against teams of humans, not individuals. ??
When Zoubin Ghahramani asked who in the audience knew the PDP volumes, essentially no hands went up and he was sad. ??

最后編輯于：2017.12.05 02:27:59

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末谴麦，一起剝皮案震驚了整個(gè)濱河市野宜，隨后出現(xiàn)的幾起案子旬牲，更是在濱河造成了極大的恐慌擂橘，老刑警劉巖，帶你破解...
沈念sama閱讀 206,968評論 6贊 482
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件摩骨，死亡現(xiàn)場離奇詭異通贞，居然都是意外死亡朗若，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,601評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門昌罩，熙熙樓的掌柜王于貴愁眉苦臉地迎上來哭懈，“玉大人，你說我怎么就攤上這事茎用∏沧埽” “怎么了？”我有些...
開封第一講書人閱讀 153,220評論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵轨功，是天一觀的道長旭斥。經(jīng)常有香客問我，道長古涧，這世上最難降的妖魔是什么垂券？我笑而不...
開封第一講書人閱讀 55,416評論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮羡滑，結(jié)果婚禮上菇爪，老公的妹妹穿的比我還像新娘。我一直安慰自己啄栓，他們只是感情好娄帖，可當(dāng)我...
茶點(diǎn)故事閱讀 64,425評論 5贊 374
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布也祠。她就那樣靜靜地躺著昙楚，像睡著了一般。火紅的嫁衣襯著肌膚如雪诈嘿。梳的紋絲不亂的頭發(fā)上堪旧，一...
開封第一講書人閱讀 49,144評論 1贊 285
城市分裂傳說
那天，我揣著相機(jī)與錄音奖亚，去河邊找鬼淳梦。笑死，一個(gè)胖子當(dāng)著我的面吹牛昔字，可吹牛的內(nèi)容都是我干的爆袍。我是一名探鬼主播，決...
沈念sama閱讀 38,432評論 3贊 401
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼作郭，長吁一口氣：“原來是場噩夢啊……” “哼陨囊！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起夹攒，我...
開封第一講書人閱讀 37,088評論 0贊 261
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤蜘醋，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后咏尝，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體压语，經(jīng)...
沈念sama閱讀 43,586評論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡啸罢，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,028評論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了胎食。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片扰才。...
茶點(diǎn)故事閱讀 38,137評論 1贊 334
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖厕怜，靈堂內(nèi)的尸體忽然破棺而出训桶，到底是詐尸還是另有隱情，我是刑警寧澤酣倾，帶...
沈念sama閱讀 33,783評論 4贊 324
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布舵揭，位于F島的核電站，受9級特大地震影響躁锡，放射性物質(zhì)發(fā)生泄漏午绳。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,343評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一映之、第九天我趴在偏房一處隱蔽的房頂上張望拦焚。院中可真熱鬧，春花似錦杠输、人聲如沸赎败。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,333評論 0贊 19
一樁弒父案蠢甲，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽僵刮。三九已至，卻和暖如春鹦牛，著一層夾襖步出監(jiān)牢的瞬間搞糕，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,559評論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工曼追，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留窍仰，地道東北人。一個(gè)月前我還...
沈念sama閱讀 45,595評論 2贊 355
代替公主和親
正文我出身青樓礼殊，卻偏偏與公主長得像驹吮，于是被迫代替她去往敵國和親。傳聞我的和親對象是個(gè)殘疾皇子晶伦，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,901評論 2贊 345