oneAPI簡(jiǎn)介
Intel 的oneAPI,目的是簡(jiǎn)化跨CPU乙嘀、GPU末购、FPGA、人工智能和其它加速器的各種計(jì)算引擎的編程開發(fā)虎谢。開發(fā)人員想達(dá)到的目的是一個(gè)解決方案盟榴,多種架構(gòu)。2020年5月6月期間發(fā)布了Base Kit (Beta)版本婴噩。
官網(wǎng)鏈接
注意可能要Intel賬號(hào)才能下載擎场。
安裝
下載的文件是.exe
結(jié)尾的,有圖形化安裝界面几莽,Next下去即可迅办。最終解壓安裝完成的效果如圖:
2GB左右的安裝包,安裝完成后大概有14GB的空間占用章蚣,如果對(duì)于Intel加速庫(kù)比較熟悉的人應(yīng)該可以看出MKL站欺,ipp,tbb等庫(kù)纤垂。在此也不得不說Intel的加速庫(kù)矾策,在正確使用的情況下效果是真的不錯(cuò)。
這里也想說一點(diǎn)峭沦,Intel新推出的11代酷睿處理器的核顯已經(jīng)可以趕上入門獨(dú)顯了贾虽。而且Intel自家的FPGA產(chǎn)品也是可以支持OpenCL異構(gòu)計(jì)算的。所以Intel迫切的想要退出一種新的解決方案熙侍,讓開發(fā)者不需要過多的了解底層硬件編寫語言:比如OpenCL C榄鉴,Verilog HDL履磨,同樣也可以寫出高性能代碼蛉抓。
同樣新的語言營(yíng)運(yùn)而生,DPC++(Data Parallel C++)剃诅,英特爾在設(shè)計(jì)DPC++的時(shí)候巷送,在語法上和CUDA非常接近,如果程序員對(duì)于CUDA非常熟悉的話矛辕,那么使用DPC++進(jìn)行編程應(yīng)該沒有任何問題笑跛。本質(zhì)上還是有C/C++語言基礎(chǔ)付魔,看懂代碼應(yīng)該沒太大難度。
測(cè)試
Intel 給了oneAPI的編程指導(dǎo)飞蹂,只不過現(xiàn)在還沒中文版:
官網(wǎng)指導(dǎo)
你也可以根據(jù)自己選的Toolkit和語言几苍,在左側(cè)欄Document出搜尋對(duì)應(yīng)的文檔。
根據(jù)編程指導(dǎo)里的介紹陈哑,Intel把所有的sample code放到了github上:
這里使用的是DPC++ compiler下的vector add妻坝,放上代碼:
dpc_common.hpp
//==============================================================
// Copyright ? 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#ifndef _DP_HPP
#define _DP_HPP
#pragma once
#include <stdlib.h>
#include <exception>
#include <CL/sycl.hpp>
namespace dpc {
// this exception handler with catch async exceptions
static auto exception_handler = [](cl::sycl::exception_list eList) {
for (std::exception_ptr const &e : eList) {
try {
std::rethrow_exception(e);
} catch (std::exception const &e) {
#if _DEBUG
std::cout << "Failure" << std::endl;
#endif
std::terminate();
}
}
};
class queue : public cl::sycl::queue {
// Enable profiling by default
cl::sycl::property_list prop_list =
cl::sycl::property_list{cl::sycl::property::queue::enable_profiling()};
public:
queue()
: cl::sycl::queue(cl::sycl::default_selector{}, exception_handler, prop_list) {}
queue(cl::sycl::device_selector &d)
: cl::sycl::queue(d, exception_handler, prop_list) {}
queue(cl::sycl::device_selector &d, cl::sycl::property_list &p)
: cl::sycl::queue(d, exception_handler, p) {}
};
using Duration = std::chrono::duration<double>;
class Timer {
public:
Timer() : start(std::chrono::steady_clock::now()) {}
Duration elapsed() {
auto now = std::chrono::steady_clock::now();
return std::chrono::duration_cast<Duration>(now - start);
}
private:
std::chrono::steady_clock::time_point start;
};
}; // namespace dpc
#endif
vector-add-buffers.cpp
//==============================================================
// Vector Add is the equivalent of a Hello, World! sample for data parallel
// programs. Building and running the sample verifies that your development
// environment is setup correctly and demonstrates the use of the core features
// of DPC++. This sample runs on both CPU and GPU (or FPGA). When run, it
// computes on both the CPU and offload device, then compares results. If the
// code executes on both CPU and offload device, the device name and a success
// message are displayed. And, your development environment is setup correctly!
//
// For comprehensive instructions regarding DPC++ Programming, go to
// https://software.intel.com/en-us/oneapi-programming-guide and search based on
// relevant terms noted in the comments.
//
// DPC++ material used in the code sample:
// ? A one dimensional array of data.
// ? A device queue, buffer, accessor, and kernel.
//==============================================================
// Copyright ? 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#include "dpc_common.hpp"
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/intel/fpga_extensions.hpp>
#endif
using namespace sycl;
// Array type and data size for this example.
constexpr size_t array_size = 10000;
typedef std::array<int, array_size> IntArray;
//************************************
// Vector add in DPC++ on device: returns sum in 4th parameter "sum_parallel".
//************************************
void VectorAdd(queue &q, const IntArray &a_array, const IntArray &b_array,
IntArray &sum_parallel) {
// Create the range object for the arrays managed by the buffer.
range<1> num_items{a_array.size()};
// Create buffers that hold the data shared between the host and the devices.
// The buffer destructor is responsible to copy the data back to host when it
// goes out of scope.
buffer a_buf(a_array);
buffer b_buf(b_array);
buffer sum_buf(sum_parallel.data(), num_items);
// Submit a command group to the queue by a lambda function that contains the
// data access permission and device computation (kernel).
q.submit([&](handler &h) {
// Create an accessor for each buffer with access permission: read, write or
// read/write. The accessor is a mean to access the memory in the buffer.
auto a = a_buf.get_access<access::mode::read>(h);
auto b = b_buf.get_access<access::mode::read>(h);
// The sum_accessor is used to store (with write permission) the sum data.
auto sum = sum_buf.get_access<access::mode::write>(h);
// Use parallel_for to run vector addition in parallel on device. This
// executes the kernel.
// 1st parameter is the number of work items.
// 2nd parameter is the kernel, a lambda that specifies what to do per
// work item. The parameter of the lambda is the work item id.
// DPC++ supports unnamed lambda kernel by default.
h.parallel_for(num_items, [=](id<1> i) { sum[i] = a[i] + b[i]; });
});
}
//************************************
// Initialize the array from 0 to array_size - 1
//************************************
void InitializeArray(IntArray &a) {
for (size_t i = 0; i < a.size(); i++) a[i] = i;
}
//************************************
// Demonstrate vector add both in sequential on CPU and in parallel on device.
//************************************
int main() {
// Create device selector for the device of your interest.
#if FPGA_EMULATOR
// DPC++ extension: FPGA emulator selector on systems without FPGA card.
intel::fpga_emulator_selector d_selector;
#elif FPGA
// DPC++ extension: FPGA selector on systems with FPGA card.
intel::fpga_selector d_selector;
#else
// The default device selector will select the most performant device.
default_selector d_selector;
#endif
// Create array objects with "array_size" to store the input and output data.
IntArray a, b, sum_sequential, sum_parallel;
// Initialize input arrays with values from 0 to array_size - 1
InitializeArray(a);
InitializeArray(b);
try {
queue q(d_selector, dpc::exception_handler);
// Print out the device information used for the kernel code.
std::cout << "Running on device: "
<< q.get_device().get_info<info::device::name>() << "\n";
std::cout << "Vector size: " << a.size() << "\n";
// Vector addition in DPC++
VectorAdd(q, a, b, sum_parallel);
} catch (exception const &e) {
std::cout << "An exception is caught for vector add.\n";
std::terminate();
}
// Compute the sum of two arrays in sequential for validation.
for (size_t i = 0; i < sum_sequential.size(); i++)
sum_sequential[i] = a[i] + b[i];
// Verify that the two arrays are equal.
for (size_t i = 0; i < sum_sequential.size(); i++) {
if (sum_parallel[i] != sum_sequential[i]) {
std::cout << "Vector add failed on device.\n";
return -1;
}
}
int indices[]{0, 1, 2, (a.size() - 1)};
constexpr size_t indices_size = sizeof(indices) / sizeof(int);
// Print out the result of vector add.
for (int i = 0; i < indices_size; i++) {
int j = indices[i];
if (i == indices_size - 1) std::cout << "...\n";
std::cout << "[" << j << "]: " << a[j] << " + " << b[j] << " = "
<< sum_parallel[j] << "\n";
}
std::cout << "Vector add successfully completed on device.\n";
return 0;
}
代碼有了,就可以編譯看看效果惊窖;在這里Intel提供的是非常完整的工具包刽宪,所以編譯器調(diào)試器一應(yīng)俱全。但是有一點(diǎn)就是沒有環(huán)境變量界酒,是無法使用這些工具的圣拄。Intel提供了環(huán)境變量終端,安裝完成oneAPI后會(huì)提供一個(gè)終端:
如上圖所示毁欣,打開這個(gè)終端庇谆,它會(huì)自動(dòng)加載環(huán)境變量。
而后我們需要做的是切換到源文件目錄凭疮,手動(dòng)編譯即可族铆。
參考編譯命令:
dpcpp -O2 -g -std=C++17 -o vector-add-buffers.exe src/vector-add-buffers.cpp
沒有報(bào)錯(cuò)即編譯成功,編譯成功后哭尝,產(chǎn)生的文件如下圖:
最后哥攘,輸入.\vector-add-buffers.exe
來運(yùn)行查看結(jié)果:
可以看到用的GPU,兩個(gè)擁有10000個(gè)元素的一維向量相加很快便執(zhí)行出來了材鹦。有多快逝淹?就真的和你打印Hello World差不多快,所以源文件開始的說明里面也說了桶唐,兩個(gè)元素個(gè)數(shù)相同的一維向量的相加栅葡,便是并行處理的Hello World。
注意:這個(gè)編譯過程切不可放進(jìn)普通終端(cmd/powershell)中進(jìn)行尤泽,因?yàn)榄h(huán)境變量的緣故欣簇。
針對(duì)單個(gè)源文件,直接采用命令編譯毫無問題坯约,但是若是有很多個(gè)cpp和終端設(shè)備的工程文件熊咽,最好考慮Visual Studio這樣的IDE,或者是CMake生成Makefile(比較推薦闹丐,跨平臺(tái)工程常用)横殴。在Github的sample code里面,Intel也給了vs的工程文件卿拴,可以直接clone整個(gè)工程衫仑,打開就可以用梨与。