使用OpenCv+ENet實(shí)現(xiàn)語義分割
轉(zhuǎn)載自 https://www.pyimagesearch.com/2018/09/03/semantic-segmentation-with-opencv-and-deep-learning/
介紹
在本教程中挡逼,您將學(xué)習(xí)如何使用OpenCV族阅,深度學(xué)習(xí)和ENet架構(gòu)執(zhí)行語義分割乌逐。閱讀本教程后峦朗,您將能夠使用OpenCV對(duì)圖像和視頻應(yīng)用語義分割公给。深度學(xué)習(xí)有助于增加計(jì)算機(jī)視覺的前所未有的準(zhǔn)確性,包括圖像分類姐帚,對(duì)象檢測衷旅,現(xiàn)在甚至是分割器一。傳統(tǒng)分割涉及將圖像分割成若干模塊(Normalized Cuts, Graph Cuts, Grab Cuts, superpixels,等); 但是课锌,算法并沒有真正理解這些部分代表什么。
另一方面盹舞,語義分割算法會(huì)作如下的工作:
- 1产镐、分割圖像劃分成有意義的部分
- 2、同時(shí)踢步,將輸入圖像中的每個(gè)像素與類標(biāo)簽(即人癣亚,道路,汽車获印,公共汽車等)相關(guān)聯(lián)述雾。
語義分割算法很強(qiáng)大,有很多用例兼丰,包括自動(dòng)駕駛汽車 - 在今天的文章中玻孟,我將向您展示如何將語義分割應(yīng)用于道路場景圖像以及視頻尊惰!
OpenCV和深度學(xué)習(xí)的語義分割
在這篇文章中蛹锰,我們將討論ENet
深度學(xué)習(xí)框架,并且演示如何使用ENet
對(duì)圖像和視頻流進(jìn)行語義分割腻要。
ENet語義分割框架
在本教程中我們將使用的語義分割框架是ENet艳丛,他是基于Paszke等人的2016年的論文Net:ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation:
Abstract: ...In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18× faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. ...(大概就是速度提高了18倍匣掸,參數(shù)減少了79倍,然后精度更高速度更快)氮双。
一個(gè)正向傳播在我的(垃圾)筆記本CPU(i5-6200)上花費(fèi)了0.5S左右的時(shí)間碰酝,如果使用GPU將更快。Paszke等人在The Cityscapes Dataset訓(xùn)練了他們的數(shù)據(jù)集戴差,你可以根據(jù)需求選擇你需要的數(shù)據(jù)集進(jìn)行訓(xùn)練送爸。并且這個(gè)數(shù)據(jù)集還帶有用于城市場景理解的圖像示例。
我們使用訓(xùn)練了20種類的模型暖释,包括:
Unlabeled (i.e., background)
Road
Sidewalk
Building
Wall
Fence
Pole
TrafficLight
TrafficSign
Vegetation
Terrain
Sky
Person
Rider
Car
Truck
Bus
Train
Motorcycle
Bicycle
接下來袭厂,您將學(xué)習(xí)如何應(yīng)用語義分段來提取圖像和視頻流中每個(gè)類別,像素之間的映射關(guān)系饭入。如果您有興趣訓(xùn)練自己的ENet模型以便在自己的自定義數(shù)據(jù)集上進(jìn)行分割嵌器,可以參考此頁面,作者已提供了有關(guān)如何進(jìn)行訓(xùn)練的教程谐丢。
工程結(jié)構(gòu)
若需要工程源碼可以直接在下方留言郵箱或者公眾號(hào)留言郵箱爽航。
下面讓我們在工程目錄下面運(yùn)行 tree
:
.
├── enet-cityscapes
│ ├── enet-classes.txt
│ ├── enet-colors.txt
│ └── enet-model.net
├── images
│ ├── example_01.png
│ ├── example_02.jpg
│ ├── example_03.jpg
│ └── example_04.png
├── output
│ └── massachusetts_output.avi
├── segment.py
├── segment.pyc
├── segment_video.py
└── videos
├── massachusetts.mp4
└── toronto.mp4
4 directories, 13 files
工程包括四個(gè)目錄:
-
enet-cityscapes/
: 包含了訓(xùn)練好了的深度學(xué)習(xí)模型,顏色列表乾忱,顏色labels讥珍。 -
images/
: 包含四個(gè)測試用的圖片。 -
output/
: 生成的輸出視頻窄瘟。 -
videos/
: 包含了兩個(gè)示例視頻用于測試我們的程序衷佃。
接下來,我們將分析兩個(gè)python腳本:
-
segment.py
: 對(duì)單個(gè)圖片進(jìn)行深度學(xué)習(xí)語義分割蹄葱,我們將首先在單個(gè)圖像進(jìn)行測試然后再將其運(yùn)用到視頻中氏义。 -
segment_video.py
: 對(duì)視頻進(jìn)行語義分割锄列。
使用OpenCv對(duì)圖像進(jìn)行語義分割:
# import the necessary packages
import numpy as np
import argparse
import imutils
import time
import cv2
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-m", "--model", required=True,
help="path to deep learning segmentation model")
ap.add_argument("-c", "--classes", required=True,
help="path to .txt file containing class labels")
ap.add_argument("-i", "--image", required=True,
help="path to input image")
ap.add_argument("-l", "--colors", type=str,
help="path to .txt file containing colors for labels")
ap.add_argument("-w", "--width", type=int, default=500,
help="desired width (in pixels) of input image")
args = vars(ap.parse_args())
首先我們需要導(dǎo)入相應(yīng)的包, 并且設(shè)置相應(yīng)的參數(shù):
- numpy Python 科學(xué)計(jì)算基礎(chǔ)包惯悠。
- argparse: python的一個(gè)命令行解析包邻邮。
- imutils: Python圖像操作函數(shù)庫,提供一系列的便利功能。
- time: Time access and conversions克婶。
- cv2 :建議安裝3.4+的版本筒严。
接下來讓我們解析類標(biāo)簽文件和顏色:
# load the class label names
CLASSES = open(args["classes"]).read().strip().split("\n")
# if a colors file was supplied, load it from disk
if args["colors"]:
COLORS = open(args["colors"]).read().strip().split("\n")
COLORS = [np.array(c.split(",")).astype("int") for c in COLORS]
COLORS = np.array(COLORS, dtype="uint8")
# otherwise, we need to randomly generate RGB colors for each class
# label
else:
# initialize a list of colors to represent each class label in
# the mask (starting with 'black' for the background/unlabeled
# regions)
np.random.seed(42)
COLORS = np.random.randint(0, 255, size=(len(CLASSES) - 1, 3),
dtype="uint8")
COLORS = np.vstack([[0, 0, 0], COLORS]).astype("uint8")
首先將CLASSES
加載到內(nèi)存中,如果我們提供了每一個(gè)類別的標(biāo)簽的COLORS
情萤,那么我們就將其加載到內(nèi)存; 若沒有則為每一個(gè)標(biāo)簽隨機(jī)生成 COLORS
鸭蛙。
為了更好的可視化,我們使用OpenCv繪制一個(gè)顏色和類別的圖列(legend):
# initialize the legend visualization
legend = np.zeros(((len(CLASSES) * 25) + 25, 300, 3), dtype="uint8")
# loop over the class names + colors
for (i, (className, color)) in enumerate(zip(CLASSES, COLORS)):
# draw the class name + color on the legend
color = [int(c) for c in color]
cv2.putText(legend, className, (5, (i * 25) + 17),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2)
cv2.rectangle(legend, (100, (i * 25)), (300, (i * 25) + 25),
tuple(color), -1)
如圖左邊所示為所繪制的legend 的效果圖:
然后我們將深度學(xué)習(xí)分割應(yīng)用于圖像:
# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNet(args["model"])
# load the input image, resize it, and construct a blob from it,
# but keeping mind mind that the original input image dimensions
# ENet was trained on was 1024x512
image = cv2.imread(args["image"])
image = imutils.resize(image, width=args["width"])
blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (1024, 512), 0,
swapRB=True, crop=False)
# perform a forward pass using the segmentation model
net.setInput(blob)
start = time.time()
output = net.forward()
end = time.time()
# show the amount of time inference took
print("[INFO] inference took {:.4f} seconds".format(end - start))
上面這段代碼筋岛,使用Python和OpenCv對(duì)圖像進(jìn)行語義分割:
-
cv2.dnn.readNet()
: 記載模型娶视。 - 構(gòu)建一個(gè)
blob
: 由于我們訓(xùn)練的ENet模型的輸入圖像的大小為1024X512
因此這里應(yīng)該使用相同的大小。 - 將
blob
輸入到網(wǎng)絡(luò)中睁宰,并且通過這個(gè)神經(jīng)網(wǎng)絡(luò)執(zhí)行一個(gè)forward pass
歇万, 并且輸出使用的時(shí)間。
可視化我們的結(jié)果
最后我們需要可視化我們的結(jié)果:
在程序的其余行中勋陪,我們將生成一個(gè)顏色蒙層以覆蓋原始圖像贪磺。 每個(gè)像素都有一個(gè)相應(yīng)的類標(biāo)簽索引,使我們可以看到屏幕上的語義分割結(jié)果诅愚。
# infer the total number of classes along with the spatial dimensions
# of the mask image via the shape of the output array
(numClasses, height, width) = output.shape[1:4]
# our output class ID map will be num_classes x height x width in
# size, so we take the argmax to find the class label with the
# largest probability for each and every (x, y)-coordinate in the
# image
classMap = np.argmax(output[0], axis=0)
# given the class ID map, we can map each of the class IDs to its
# corresponding color
mask = COLORS[classMap]
cv2.imshow("mask", mask)
# resize the mask and class map such that its dimensions match the
# original size of the input image (we're not using the class map
# here for anything else but this is how you would resize it just in
# case you wanted to extract specific pixels/classes)
mask = cv2.resize(mask, (image.shape[1], image.shape[0]),
interpolation=cv2.INTER_NEAREST)
classMap = cv2.resize(classMap, (image.shape[1], image.shape[0]),
interpolation=cv2.INTER_NEAREST)
# perform a weighted combination of the input image with the mask to
# form an output visualization
output = ((0.4 * image) + (0.6 * mask)).astype("uint8")
# show the input and output images
cv2.imshow("Legend", legend)
cv2.imshow("Input", image)
cv2.imshow("Output", output)
cv2.waitKey(0)
if cv2.waitKey(1) & 0xFF == ord('q'):
exit
我們首先是從output
中提取出 numClasses, height, width
寒锚,然后計(jì)算 classMap
和mask
。其中 classMap
是output
的每個(gè)(x违孝,y)坐標(biāo)的最大概率的類標(biāo)簽索引(class label index)刹前。通過 calssMap
作為Numpy的數(shù)組索引來找到每個(gè)像素相對(duì)應(yīng)的可視化顏色。
之后就是簡單的尺寸變換以使得尺寸相同雌桑,之后進(jìn)行疊加喇喉。
單個(gè)圖像的結(jié)果:
根據(jù)用法輸入相應(yīng)的命令行參數(shù),運(yùn)行程序校坑,以下是一個(gè)示例:
python3 segment.py --model enet-cityscapes/enet-model.net --classes enet-cityscapes/enet-classes.txt --colors enet-cityscapes/enet-colors.txt --image images/example_03.jpg
最終得到的結(jié)果:
很容易發(fā)現(xiàn)拣技,它可以清晰地分類并準(zhǔn)確識(shí)別人和自行車。確定了道路耍目,人行道膏斤,汽車。邪驮。
在視頻中執(zhí)行語義分割:
這個(gè)部分的代碼位于 segment_video.py
中莫辨, 首先加載模型,初始化視頻流:
# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNet(args["model"])
# initialize the video stream and pointer to output video file
vs = cv2.VideoCapture(args["video"])
writer = None
# try to determine the total number of frames in the video file
try:
prop = cv2.cv.CV_CAP_PROP_FRAME_COUNT if imutils.is_cv2() \
else cv2.CAP_PROP_FRAME_COUNT
total = int(vs.get(prop))
print("[INFO] {} total frames in video".format(total))
# an error occurred while trying to determine the total
# number of frames in the video file
except:
print("[INFO] could not determine # of frames in video")
total = -1
之后讀取視頻流,并且作為網(wǎng)絡(luò)的輸入沮榜,這部分和 segment.py
大致相同:
# loop over frames from the video file stream
while True:
# read the next frame from the file
(grabbed, frame) = vs.read()
# if the frame was not grabbed, then we have reached the end
# of the stream
if not grabbed:
break
# construct a blob from the frame and perform a forward pass
# using the segmentation model
frame = imutils.resize(frame, width=args["width"])
blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (1024, 512), 0,
swapRB=True, crop=False)
net.setInput(blob)
start = time.time()
output = net.forward()
end = time.time()
# infer the total number of classes along with the spatial
# dimensions of the mask image via the shape of the output array
(numClasses, height, width) = output.shape[1:4]
# our output class ID map will be num_classes x height x width in
# size, so we take the argmax to find the class label with the
# largest probability for each and every (x, y)-coordinate in the
# image
classMap = np.argmax(output[0], axis=0)
# given the class ID map, we can map each of the class IDs to its
# corresponding color
mask = COLORS[classMap]
# resize the mask such that its dimensions match the original size
# of the input frame
mask = cv2.resize(mask, (frame.shape[1], frame.shape[0]),
interpolation=cv2.INTER_NEAREST)
# perform a weighted combination of the input frame with the mask
# to form an output visualization
output = ((0.3 * frame) + (0.7 * mask)).astype("uint8")
然后我們將輸出的視頻流寫入到文件中:
# check if the video writer is None
if writer is None:
# initialize our video writer
fourcc = cv2.VideoWriter_fourcc(*"MJPG")
writer = cv2.VideoWriter(args["output"], fourcc, 30,
(output.shape[1], output.shape[0]), True)
# some information on processing single frame
if total > 0:
elap = (end - start)
print("[INFO] single frame took {:.4f} seconds".format(elap))
print("[INFO] estimated total time: {:.4f}".format(
elap * total))
# write the output frame to disk
writer.write(output)
# check to see if we should display the output frame to our screen
if args["show"] > 0:
cv2.imshow("Frame", output)
key = cv2.waitKey(1) & 0xFF
# if the `q` key was pressed, break from the loop
if key == ord("q"):
break
最終的視頻演示可以查看下面的視頻:
python3 segment_video.py --model enet-cityscapes/enet-model.net \
--classes enet-cityscapes/enet-classes.txt \
--colors enet-cityscapes/enet-colors.txt \
--video videos/massachusetts.mp4 \
--output output/massachusetts_output.avi
<iframe src="http://player.bilibili.com/player.html?aid=31352524&cid=54788807&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>
最后如何訓(xùn)練自己的模型:
如果你想訓(xùn)練自己的模型盘榨,可以參考 ENet作者提供的教程
總結(jié)
在這個(gè)文章中,我們學(xué)習(xí)了如何使用OpenCV蟆融,深度學(xué)習(xí)和ENet架構(gòu)來應(yīng)用語義分割较曼。在Cityscapes數(shù)據(jù)集上使用預(yù)先訓(xùn)練的ENet模型,我們能夠在自動(dòng)駕駛汽車和道路場景分割的背景下將圖像和視頻流分成20個(gè)類別振愿,包括人(步行和騎自行車),車輛(汽車弛饭,卡車冕末,公共汽車,摩托車等)侣颂,建筑(建筑物档桃,墻壁,圍欄等)憔晒,以及植被藻肄,地形和地面本身。如果您喜歡今天的文章拒担,請(qǐng)分享哦嘹屯!
需要源代碼的朋友可以在下方留言郵箱即可。
參考文獻(xiàn)
- 十分鐘看懂圖像語義分割技術(shù)
- [論文]ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
- 訓(xùn)練自己的ENet模型
- 什么是神經(jīng)網(wǎng)絡(luò)中的前向和后向傳遞从撼?
不足之處州弟,敬請(qǐng)斧正; 若你覺得文章還不錯(cuò),請(qǐng)關(guān)注微信公眾號(hào)“SLAM 技術(shù)交流”繼續(xù)支持我們低零,筆芯:D婆翔。