oneAPI的測(cè)試：Vector-add

oneAPI簡(jiǎn)介

Intel 的oneAPI，目的是簡(jiǎn)化跨CPU乙嘀、GPU末购、FPGA、人工智能和其它加速器的各種計(jì)算引擎的編程開發(fā)虎谢。開發(fā)人員想達(dá)到的目的是一個(gè)解決方案盟榴，多種架構(gòu)。2020年5月6月期間發(fā)布了Base Kit (Beta)版本婴噩。

官網(wǎng)介紹

官網(wǎng)鏈接
注意可能要Intel賬號(hào)才能下載擎场。

安裝

下載的文件是.exe結(jié)尾的，有圖形化安裝界面几莽，Next下去即可迅办。最終解壓安裝完成的效果如圖：

2GB左右的安裝包，安裝完成后大概有14GB的空間占用章蚣，如果對(duì)于Intel加速庫(kù)比較熟悉的人應(yīng)該可以看出MKL站欺，ipp，tbb等庫(kù)纤垂。在此也不得不說Intel的加速庫(kù)矾策，在正確使用的情況下效果是真的不錯(cuò)。

這里也想說一點(diǎn)峭沦，Intel新推出的11代酷睿處理器的核顯已經(jīng)可以趕上入門獨(dú)顯了贾虽。而且Intel自家的FPGA產(chǎn)品也是可以支持OpenCL異構(gòu)計(jì)算的。所以Intel迫切的想要退出一種新的解決方案熙侍，讓開發(fā)者不需要過多的了解底層硬件編寫語言：比如OpenCL C榄鉴，Verilog HDL履磨，同樣也可以寫出高性能代碼蛉抓。

同樣新的語言營(yíng)運(yùn)而生，DPC++（Data Parallel C++）剃诅，英特爾在設(shè)計(jì)DPC++的時(shí)候巷送，在語法上和CUDA非常接近，如果程序員對(duì)于CUDA非常熟悉的話矛辕，那么使用DPC++進(jìn)行編程應(yīng)該沒有任何問題笑跛。本質(zhì)上還是有C/C++語言基礎(chǔ)付魔，看懂代碼應(yīng)該沒太大難度。

測(cè)試

Intel 給了oneAPI的編程指導(dǎo)飞蹂，只不過現(xiàn)在還沒中文版：

GUIDE

官網(wǎng)指導(dǎo)
你也可以根據(jù)自己選的Toolkit和語言几苍，在左側(cè)欄Document出搜尋對(duì)應(yīng)的文檔。

根據(jù)編程指導(dǎo)里的介紹陈哑，Intel把所有的sample code放到了github上：

oneAPI-github-Sample code

Github

這里使用的是DPC++ compiler下的vector add妻坝，放上代碼：

dpc_common.hpp

//==============================================================
// Copyright ? 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#ifndef _DP_HPP
#define _DP_HPP

#pragma once

#include <stdlib.h>
#include <exception>

#include <CL/sycl.hpp>

namespace dpc {
// this exception handler with catch async exceptions
static auto exception_handler = [](cl::sycl::exception_list eList) {
  for (std::exception_ptr const &e : eList) {
    try {
      std::rethrow_exception(e);
    } catch (std::exception const &e) {
#if _DEBUG
      std::cout << "Failure" << std::endl;
#endif
      std::terminate();
    }
  }
};

class queue : public cl::sycl::queue {
  // Enable profiling by default
  cl::sycl::property_list prop_list =
      cl::sycl::property_list{cl::sycl::property::queue::enable_profiling()};

 public:
  queue()
      : cl::sycl::queue(cl::sycl::default_selector{}, exception_handler, prop_list) {}
  queue(cl::sycl::device_selector &d)
      : cl::sycl::queue(d, exception_handler, prop_list) {}
  queue(cl::sycl::device_selector &d, cl::sycl::property_list &p)
      : cl::sycl::queue(d, exception_handler, p) {}
};

using Duration = std::chrono::duration<double>;

class Timer {
 public:
  Timer() : start(std::chrono::steady_clock::now()) {}

  Duration elapsed() {
    auto now = std::chrono::steady_clock::now();
    return std::chrono::duration_cast<Duration>(now - start);
  }

 private:
  std::chrono::steady_clock::time_point start;
};

};  // namespace dpc

#endif

vector-add-buffers.cpp

//==============================================================
// Vector Add is the equivalent of a Hello, World! sample for data parallel
// programs. Building and running the sample verifies that your development
// environment is setup correctly and demonstrates the use of the core features
// of DPC++. This sample runs on both CPU and GPU (or FPGA). When run, it
// computes on both the CPU and offload device, then compares results. If the
// code executes on both CPU and offload device, the device name and a success
// message are displayed. And, your development environment is setup correctly!
//
// For comprehensive instructions regarding DPC++ Programming, go to
// https://software.intel.com/en-us/oneapi-programming-guide and search based on
// relevant terms noted in the comments.
//
// DPC++ material used in the code sample:
// ?    A one dimensional array of data.
// ?    A device queue, buffer, accessor, and kernel.
//==============================================================
// Copyright ? 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#include "dpc_common.hpp"
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/intel/fpga_extensions.hpp>
#endif

using namespace sycl;

// Array type and data size for this example.
constexpr size_t array_size = 10000;
typedef std::array<int, array_size> IntArray;

//************************************
// Vector add in DPC++ on device: returns sum in 4th parameter "sum_parallel".
//************************************
void VectorAdd(queue &q, const IntArray &a_array, const IntArray &b_array,
               IntArray &sum_parallel) {
  // Create the range object for the arrays managed by the buffer.
  range<1> num_items{a_array.size()};

  // Create buffers that hold the data shared between the host and the devices.
  // The buffer destructor is responsible to copy the data back to host when it
  // goes out of scope.
  buffer a_buf(a_array);
  buffer b_buf(b_array);
  buffer sum_buf(sum_parallel.data(), num_items);

  // Submit a command group to the queue by a lambda function that contains the
  // data access permission and device computation (kernel).
  q.submit([&](handler &h) {
    // Create an accessor for each buffer with access permission: read, write or
    // read/write. The accessor is a mean to access the memory in the buffer.
    auto a = a_buf.get_access<access::mode::read>(h);
    auto b = b_buf.get_access<access::mode::read>(h);

    // The sum_accessor is used to store (with write permission) the sum data.
    auto sum = sum_buf.get_access<access::mode::write>(h);

    // Use parallel_for to run vector addition in parallel on device. This
    // executes the kernel.
    //    1st parameter is the number of work items.
    //    2nd parameter is the kernel, a lambda that specifies what to do per
    //    work item. The parameter of the lambda is the work item id.
    // DPC++ supports unnamed lambda kernel by default.
    h.parallel_for(num_items, [=](id<1> i) { sum[i] = a[i] + b[i]; });
  });
}

//************************************
// Initialize the array from 0 to array_size - 1
//************************************
void InitializeArray(IntArray &a) {
  for (size_t i = 0; i < a.size(); i++) a[i] = i;
}

//************************************
// Demonstrate vector add both in sequential on CPU and in parallel on device.
//************************************
int main() {
  // Create device selector for the device of your interest.
#if FPGA_EMULATOR
  // DPC++ extension: FPGA emulator selector on systems without FPGA card.
  intel::fpga_emulator_selector d_selector;
#elif FPGA
  // DPC++ extension: FPGA selector on systems with FPGA card.
  intel::fpga_selector d_selector;
#else
  // The default device selector will select the most performant device.
  default_selector d_selector;
#endif

  // Create array objects with "array_size" to store the input and output data.
  IntArray a, b, sum_sequential, sum_parallel;

  // Initialize input arrays with values from 0 to array_size - 1
  InitializeArray(a);
  InitializeArray(b);

  try {
    queue q(d_selector, dpc::exception_handler);

    // Print out the device information used for the kernel code.
    std::cout << "Running on device: "
              << q.get_device().get_info<info::device::name>() << "\n";
    std::cout << "Vector size: " << a.size() << "\n";

    // Vector addition in DPC++
    VectorAdd(q, a, b, sum_parallel);
  } catch (exception const &e) {
    std::cout << "An exception is caught for vector add.\n";
    std::terminate();
  }

  // Compute the sum of two arrays in sequential for validation.
  for (size_t i = 0; i < sum_sequential.size(); i++)
    sum_sequential[i] = a[i] + b[i];

  // Verify that the two arrays are equal.
  for (size_t i = 0; i < sum_sequential.size(); i++) {
    if (sum_parallel[i] != sum_sequential[i]) {
      std::cout << "Vector add failed on device.\n";
      return -1;
    }
  }

  int indices[]{0, 1, 2, (a.size() - 1)};
  constexpr size_t indices_size = sizeof(indices) / sizeof(int);

  // Print out the result of vector add.
  for (int i = 0; i < indices_size; i++) {
    int j = indices[i];
    if (i == indices_size - 1) std::cout << "...\n";
    std::cout << "[" << j << "]: " << a[j] << " + " << b[j] << " = "
              << sum_parallel[j] << "\n";
  }

  std::cout << "Vector add successfully completed on device.\n";
  return 0;
}

代碼有了，就可以編譯看看效果惊窖；在這里Intel提供的是非常完整的工具包刽宪，所以編譯器調(diào)試器一應(yīng)俱全。但是有一點(diǎn)就是沒有環(huán)境變量界酒，是無法使用這些工具的圣拄。Intel提供了環(huán)境變量終端，安裝完成oneAPI后會(huì)提供一個(gè)終端：

cmd.png

如上圖所示毁欣，打開這個(gè)終端庇谆，它會(huì)自動(dòng)加載環(huán)境變量。

加載環(huán)境變量成功

而后我們需要做的是切換到源文件目錄凭疮，手動(dòng)編譯即可族铆。

編譯

參考編譯命令：

dpcpp -O2 -g -std=C++17 -o vector-add-buffers.exe src/vector-add-buffers.cpp

沒有報(bào)錯(cuò)即編譯成功，編譯成功后哭尝，產(chǎn)生的文件如下圖：

文件結(jié)構(gòu)

最后哥攘，輸入.\vector-add-buffers.exe來運(yùn)行查看結(jié)果：

結(jié)果

可以看到用的GPU，兩個(gè)擁有10000個(gè)元素的一維向量相加很快便執(zhí)行出來了材鹦。有多快逝淹？就真的和你打印Hello World差不多快，所以源文件開始的說明里面也說了桶唐，兩個(gè)元素個(gè)數(shù)相同的一維向量的相加栅葡，便是并行處理的Hello World。

注意：這個(gè)編譯過程切不可放進(jìn)普通終端（cmd/powershell）中進(jìn)行尤泽，因?yàn)榄h(huán)境變量的緣故欣簇。
針對(duì)單個(gè)源文件，直接采用命令編譯毫無問題坯约，但是若是有很多個(gè)cpp和終端設(shè)備的工程文件熊咽，最好考慮Visual Studio這樣的IDE，或者是CMake生成Makefile（比較推薦闹丐，跨平臺(tái)工程常用）横殴。在Github的sample code里面，Intel也給了vs的工程文件卿拴，可以直接clone整個(gè)工程衫仑，打開就可以用梨与。

最后編輯于：2020.07.08 17:08:43

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市文狱，隨后出現(xiàn)的幾起案子粥鞋，更是在濱河造成了極大的恐慌，老刑警劉巖瞄崇，帶你破解...
沈念sama閱讀 217,084評(píng)論 6贊 503
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件陷虎，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡杠袱，警方通過查閱死者的電腦和手機(jī)尚猿，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,623評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來楣富，“玉大人凿掂，你說我怎么就攤上這事∥坪” “怎么了庄萎？”我有些...
開封第一講書人閱讀 163,450評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)塘安。經(jīng)常有香客問我糠涛，道長(zhǎng)，這世上最難降的妖魔是什么兼犯？我笑而不...
開封第一講書人閱讀 58,322評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任忍捡，我火速辦了婚禮，結(jié)果婚禮上切黔，老公的妹妹穿的比我還像新娘砸脊。我一直安慰自己，他們只是感情好纬霞，可當(dāng)我...
茶點(diǎn)故事閱讀 67,370評(píng)論 6贊 390
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布凌埂。她就那樣靜靜地躺著，像睡著了一般诗芜。火紅的嫁衣襯著肌膚如雪瞳抓。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,274評(píng)論 1贊 300
城市分裂傳說
那天伏恐，我揣著相機(jī)與錄音孩哑，去河邊找鬼。笑死脐湾，一個(gè)胖子當(dāng)著我的面吹牛臭笆，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播秤掌，決...
沈念sama閱讀 40,126評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼愁铺，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來了闻鉴？” 一聲冷哼從身側(cè)響起茵乱，我...
開封第一講書人閱讀 38,980評(píng)論 0贊 275
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎孟岛，沒想到半個(gè)月后瓶竭，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,414評(píng)論 1贊 313
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡渠羞，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,599評(píng)論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年斤贰，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片次询。...
茶點(diǎn)故事閱讀 39,773評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡荧恍，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出屯吊，到底是詐尸還是另有隱情送巡，我是刑警寧澤，帶...
沈念sama閱讀 35,470評(píng)論 5贊 344
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布盒卸，位于F島的核電站骗爆，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏蔽介。R本人自食惡果不足惜摘投，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,080評(píng)論 3贊 327
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望虹蓄。院中可真熱鬧谷朝，春花似錦、人聲如沸武花。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,713評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)体箕。三九已至专钉，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間累铅，已是汗流浹背跃须。一陣腳步聲響...
開封第一講書人閱讀 32,852評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國(guó)打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留娃兽，地道東北人菇民。一個(gè)月前我還...
沈念sama閱讀 47,865評(píng)論 2贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親第练。傳聞我的和親對(duì)象是個(gè)殘疾皇子阔馋，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,689評(píng)論 2贊 354