一、前言
在前一篇文章中,我们已经实现了一个简易的智能AI网络监控摄像头方案,该方案采用“纯边缘计算+流媒体”架构,通过 MIPI 接口获取高清视频流,利用 NPU 在本地飞速执行“人脸检测 (YOLO)”实现智能监控,这篇文章我们将在上一篇文章的基础上,引入人脸关键点定位模型,通过NPU 执行“人脸检测 (YOLO) + 68点人脸关键点定位”的双模型级联推理,CPU 核心通过分析关键点坐标实时计算人眼的闭合比例(EAR, Eye Aspect Ratio)以精准判定人员的疲劳/打盹状态,总检测时间仅约36ms。
二、推理流水线设计
2.1 第一级:YOLO-Fastest 人脸检测与硬件图像预处理
图像数据的源头由 OV5640 摄像头通过高带宽的 MIPI CSI-2 接口输入。为了减轻 CPU 的负担,本方案启用了 RA8P1 内置的视频输入模块(VIN)进行硬件加速,自动将 MIPI 传来的 YUV422 视频流转换为 RGB565 格式,并通过内部 DMA 直接“零拷贝”写入位于外部 HyperRAM 中的帧缓冲区。 随后,NPU 接管数据,执行 INT8 量化的 YOLO-Fastest 人脸检测模型。该模型输入尺寸为 192x192,NPU 单帧推理耗时仅约 8 毫秒,能够以极高的帧率在全图中锁定人脸的感兴趣区域(ROI)。
2.2 第二级:68 点人脸关键点定位与特征提取
在 YOLO 模型输出人脸边界框后,Cortex-M85 CPU 发挥其强大的标量处理能力,迅速将该 ROI 区域从 HyperRAM 的原始图像中裁剪并缩放,作为第二级模型的输入。随后,再次唤醒 NPU 执行 68 点人脸关键点(Landmark)检测模型,精准回归出人脸五官的像素坐标。
2.3 外部 Flash XIP 高速映射
两套 INT8 模型的权重体积庞大,强行放入 MCU 内部的 1MB MRAM 中显然捉襟见肘。 为此,本方案启用了板载的高速八线外部 Flash(W35T51NW,容量高达 64MB)。通过手动编写底层FLM 烧录算法,利用 RA8P1 的 OSPI0 控制器(CS1 片选),将这块外部 Flash 直接以 XIP(Execute-In-Place)模式映射至 CPU 的 0x90000000 物理地址空间。在实际推理时,NPU 能够像访问内部 SRAM 一样,通过 AXI/AHB 总线高速读取外部模型权重,完美实现了算力与存储容量的平衡。
三、实现细节
3.1 添加Flash XIP支持
- 在Keil Package 安装Renesas DFP 6.4.0支持包,随后在DFP包安装目录找到RA8P1 FLM模板
PS: > ${RA_DFP_DIR}\Arm\Packs\Renesas\RA_DFP\6.4.0\Flash\RA8P1
2. 由于模板代码中已有OSPI的支持RA8P1_SI_8M,我们只需参照W35T51NW修改FlashPrg.c,以及FlashDev.c
- 修改设备起始地址,设备类型,以及扇区大小
- 依照W35T51NW手册修改Flash Command (Program,Read,Erase,Status)定义
- 参考Titanboard SDK/Flash 文件系统使用说明修改Flash Pin脚定义
注意:由于原模板代码使用的是OSPI Channel 1 CS 1,我们还需要额外将OSPI通道和片选更换为Titan Board上的Channel 0 CS1!此外为了完整的使用Flash 64M空间,还需要将地址模式更改为4bytes
修改完成之后直接编译,便可获得FLM算法文件,将此文件复制到Keil Flash或者RA DFP Flash目录,同时在Flash Download添加,即可启用Flash下载
PS: > ${RA_DFP_DIR}\Arm\Packs\Renesas\RA_DFP\6.4.0\Flash
# 或者
PS:> ${KEIL_INSTALL_DIR}\ARM\Flash
参考Titanboard SDK/Flash 文件系统使用说明,在我们的FSP工程中添加OSPI支持
初始化并启用OSPI模块
fsp_err_t ospi_b_init(void)
{
/* By default, the flash device is in SPI mode, so it is necessary to open the OSPI module in SPI mode */
fsp_err_t err = FSP_SUCCESS;
spi_flash_direct_transfer_t transfer =
{
.command = RESET_VALUE,
.address = RESET_VALUE,
.data = RESET_VALUE,
.command_length = RESET_VALUE,
.address_length = RESET_VALUE,
.data_length = RESET_VALUE,
.dummy_cycles = RESET_VALUE
};
/* Open OSPI module */
err = R_OSPI_B_Open(&g_ospi_b0_ctrl, &g_ospi_b0_cfg);
ERROR(err, "R_OSPI_B_Open API FAILED");
/* Switch OSPI module to 1S-1S-1S mode to condiv flash device */
err = R_OSPI_B_SpiProtocolSet(&g_ospi_b0_ctrl,
SPI_FLASH_PROTOCOL_EXTENDED_SPI);
ERROR(err, "R_OSPI_B_SpiProtocolSet API FAILED");
/* Reset flash device by driving OM_RESET pin */
R_XSPI0->LIOCTL_b.RSTCS0 = 0;
R_BSP_SoftwareDelay(OSPI_B_TIME_RESET_PULSE, OSPI_B_TIME_UNIT);
R_XSPI0->LIOCTL_b.RSTCS0 = 1;
R_BSP_SoftwareDelay(OSPI_B_TIME_RESET_SETUP, OSPI_B_TIME_UNIT);
fsp_err_t err = FSP_SUCCESS;
spi_flash_direct_transfer_t transfer =
{
.command = RESET_VALUE,
.address = RESET_VALUE,
.data = RESET_VALUE,
.command_length = RESET_VALUE,
.address_length = RESET_VALUE,
.data_length = RESET_VALUE,
.dummy_cycles = RESET_VALUE
};
bsp_octaclk_settings_t octaclk =
{
.source_clock = RESET_VALUE,
.divider = RESET_VALUE
};
if (SPI_FLASH_PROTOCOL_EXTENDED_SPI == g_ospi_b0_ctrl.spi_protocol)
{
/* Do nothing */
}
else if (SPI_FLASH_PROTOCOL_8D_8D_8D == g_ospi_b0_ctrl.spi_protocol)
{
/* Transfer write enable command */
err = ospi_b_write_enable();
ERROR(err, "ospi_b_write_enable FAILED");
#if 0 //mask for Winbond QSPI Flash W35T51NW
/* Write to CFR5V Register to Condiv flash device interface mode */
transfer = g_ospi_b_direct_transfer[OSPI_B_TRANSFER_WRITE_CFR5V_OPI];
transfer.data = OSPI_B_DATA_SET_SPI_CFR5V_REGISTER;
err = R_OSPI_B_DirectTransfer(&g_ospi_b_ctrl, &transfer, SPI_FLASH_DIRECT_TRANSFER_DIR_WRITE);
ERROR(err, "R_OSPI_B_DirectTransfer API FAILED");
#endif
/* Change the OCTACLK clock to 100 MHz in SDR mode without OM_DQS */
octaclk.source_clock = BSP_CLOCKS_SOURCE_CLOCK_PLL2P;
octaclk.divider = BSP_CLOCKS_OCTA_CLOCK_DIV_2;
R_BSP_OctaclkUpdate(&octaclk);
/* Switch OSPI module mode to SPI mode */
err = R_OSPI_B_SpiProtocolSet(&g_ospi_b0_ctrl, SPI_FLASH_PROTOCOL_EXTENDED_SPI);
ERROR(err, "R_OSPI_SpiProtocolSet API FAILED");
#if 0 //mask for Winbond QSPI Flash W35T51NW
/* Read back and verify CFR5V register data */
transfer = g_ospi_b_direct_transfer[OSPI_B_TRANSFER_READ_CFR5V_SPI];
err = R_OSPI_B_DirectTransfer(&g_ospi_b_ctrl, &transfer, SPI_FLASH_DIRECT_TRANSFER_DIR_READ);
ERROR(err, "R_OSPI_B_DirectTransfer API FAILED");
if(OSPI_B_DATA_SET_SPI_CFR5V_REGISTER != (uint8_t)transfer.data)
{
ERROR(err, "Verify CFR5V register data FAILED");
return FSP_ERR_ABORTED;
}
#endif
}
else
{
ERROR(err, "Flash device do not support this mode\r\n");
return FSP_ERR_INVALID_MODE;
}
return FSP_SUCCESS;
}
#include "async_logger.h"
#include "raw_thread.h"
#include "hal_data.h"
#define HYPERRAM_BASE_ADDR 0x70000000
#define TEST_OFFSET 0x100000
#define FLASH_TEST_ADDRESS 0x00000000 // 测试第 0 个扇区
#define FLASH_MAPPED_ADDR 0x90000000 // FSP 中的映射首地址
void ospi_flash_init(void);
void ospi_wait_for_ready(void);
void ospi_flash_test(void);
int _flash_init(void);
/* HyperRAM Test Thread entry function */
void raw_thread_entry(void)
{
/* 初始化 OSPI (HyperRAM) */
fsp_err_t err = R_OSPI_B_Open(&g_ospi1_ctrl, &g_ospi1_cfg);
if (err == FSP_SUCCESS)
{
APP_PRINT("\r\n[ram_thread] HyperRAM Initialized Successfully!\r\n");
}
else
{
APP_PRINT("** ERROR: HyperRAM Init failed **\r\n");
APP_ERR_TRAP(err);
}
// 初始化OSPI (HyperFlash)
if (0 == _flash_init())
{
APP_PRINT("\r\n[ram_thread] HyperFlash Initialized Successfully!\r\n");
}else
{
APP_PRINT("** ERROR: HyperFlash Init failed **\r\n");
APP_ERR_TRAP(err);
}
/* 简单的读写测试 */
uint32_t *test_ptr = (uint32_t *)(HYPERRAM_BASE_ADDR + TEST_OFFSET);
uint32_t magic_word = 0xDEADBEEF;
/* 写入测试 */
*test_ptr = magic_word;
SCB_CleanInvalidateDCache_by_Addr(test_ptr, 4);
/* 读取校验 */
uint32_t read_back = *test_ptr;
if (read_back == magic_word)
{
APP_PRINT("\r\n[ram_thread] HyperRAM Read/Write Test PASS! Data: 0x%08X\r\n", read_back);
}
else
{
APP_PRINT("** Error HyperRAM Test FAILED! Expected 0x%08X, got 0x%08X\r\n", magic_word, read_back);
}
while (1)
{
tx_thread_sleep(100);
}
}
3.2 训练部署人脸关键点检测模型
参照610265158/Peppa_Pig_Face_Landmark
下载Peppa_Pig_Face_Landmark项目
$ git clone https://github.com/610265158/Peppa_Pig_Face_Landmark
$ cd TRAIN/face_landmark
下载WFLM人脸关键点标注数据集,在train_config.py中指定数据集位置
config.DATA.root_path='../WFLW_images'
config.DATA.train_f_path='../WFLW_annotations/list_98pt_rect_attr_train_test/list_98pt_rect_attr_train.txt'
config.DATA.val_f_path='../WFLW_annotations/list_98pt_rect_attr_train_test/list_98pt_rect_attr_test.txt'
config.DATA.extra_data=False
训练
$ run.sh
训练结果
将模型转换为ONNX
$ python tools/convert_to_onnx.py --model ./keypoints.pth
使用Renesas_RUHMI将ONNX模型转换为部署的C Code,注意添加Flash参数
$ python mcu_compile.py ../face ../deploy_output --npu --weight-loc=Flash
修改模型command_stream, model_data数组的section__attribute__((aligned(16), section(".ospi0_cs1"))),指定下载到Flash
__attribute__((aligned(16), section(".ospi0_cs1"))) const uint8_t face_sub_0000_command_stream[15332]
__attribute__((aligned(16), section(".ospi0_cs1"))) const uint8_t face_sub_0000_model_data[684192]
疲劳状态判断
本方案通过分析人脸关键点坐标实时计算人眼的闭合比例(EAR, Eye Aspect Ratio)以精准判定人员的疲劳/打盹状态
在标准的 68 点人脸模型中:
左眼的 6 个关键点序号是:
36, 37, 38, 39, 40, 41
右眼的 6 个关键点序号是:
42, 43, 44, 45, 46, 47
- 分子是眼睛垂直方向的两组点距之和(反映眼睛张开的高度)。
- 分母是眼睛水平方向的点距(反映眼睛的宽度,乘 2 是为了权重复合)。
- 当人睁眼时,EAR 大约在 0.25∼0.35 之间波动;当人闭眼时,EAR 会骤降到 0.1 以下
#include "async_logger.h"
#include "ear.h"
#include "face_landmarks.h"
#include "face_model.h"
#include "model.h"
#include "npu_infer_thread.h"
#include "pmu_ethosu.h"
#include "yolo_rtthread.h"
#include <math.h>
#define CAM_WIDTH 320
#define CAM_HEIGHT 240
#define COLOR_RED 0xF800
#define COLOR_GREEN 0x07E0
#define RGB565_GREEN 0x07E0
#define RGB565_RED 0xF800
#define RGB565_BLUE 0x001F
#define RGB565_YELLOW 0xFFE0
#define LANDMARK_COLOR RGB565_GREEN
extern uint8_t camera_capture_image_rgb565[];
extern TX_SEMAPHORE g_ai_start_sem;
extern TX_SEMAPHORE g_ai_done_sem;
int8_t in_i8[INPUT_W * INPUT_H] BSP_PLACE_IN_SECTION(".ospi1_cs0_noinit") BSP_ALIGN_VARIABLE(8);
float out_f1[output1_len] BSP_PLACE_IN_SECTION(".ospi1_cs0_noinit") BSP_ALIGN_VARIABLE(32);
float out_f2[output2_len] BSP_PLACE_IN_SECTION(".ospi1_cs0_noinit") BSP_ALIGN_VARIABLE(32);
int8_t in_i8_face[160 * 160 * 3] BSP_PLACE_IN_SECTION(".ospi1_cs0_noinit") BSP_ALIGN_VARIABLE(8);
float out_f1_landmarks[136] BSP_PLACE_IN_SECTION(".ospi1_cs0_noinit") BSP_ALIGN_VARIABLE(32);
void draw_bbox_rgb565(uint8_t *img_buffer, int img_w, int img_h, int x, int y, int w, int h, uint16_t color,
int line_width);
void draw_points_rgb565(uint8_t *img_buffer, int img_w, int img_h, int x, int y, int radius);
FatigueTracker ear_tracker;
/* Infer Thread entry function */
void npu_infer_thread_entry(void)
{
int16_t status = FSP_SUCCESS;
status = RM_ETHOSU_Open(&g_rm_ethosu0_ctrl, &g_rm_ethosu0_cfg);
if (status != FSP_SUCCESS)
{
APP_PRINT("** [Infer Thread] Failed to start NPU");
return;
}
init_fatigue_tracker(&ear_tracker);
while (1)
{
if (tx_semaphore_get(&g_ai_start_sem, TX_WAIT_FOREVER) == TX_SUCCESS)
{
ULONG infer_start_time = tx_time_get();
rgb565_to_gray_resize_192_and_quantization(camera_capture_image_rgb565, CAM_WIDTH, CAM_HEIGHT, in_i8);
memcpy(GetModelInputPtr_serving_default_image_input_0(), in_i8, INPUT_SIZE);
RunModel(false);
int8_t *output1 = GetModelOutputPtr_StatefulPartitionedCall_0_70273(); // 6
int8_t *output2 = GetModelOutputPtr_StatefulPartitionedCall_1_70283(); // 12
dequantize_int8(output1, out_f1, output1_len, scale_out1, zero_point_out1);
dequantize_int8(output2, out_f2, output2_len, scale_out2, zero_point_out2);
int16_t total = 0;
static det_box_t pool[540]; // 6*6*3 + 12*12*3 = 108 + 432 = 540
total += decode_output_layer(out_f1, GRID_SIZE_1, 0, CAM_WIDTH, CAM_HEIGHT, CONF_THRESH, pool + total,
(int16_t)(sizeof(pool) / sizeof(pool[0])) - total);
total += decode_output_layer(out_f2, GRID_SIZE_2, 1, CAM_WIDTH, CAM_HEIGHT, CONF_THRESH, pool + total,
(int16_t)(sizeof(pool) / sizeof(pool[0])) - total);
// nms
int32_t kept = nms_filter(pool, total, NMS_THRESH);
int32_t out_n = MIN(kept, MAX_BOXES); // 拷贝前 kept(不超过 MAX_BOXES)
if (out_n > 0)
{
int x = pool->x1;
int y = pool->y1;
int w = abs(pool->x2 - pool->x1);
int h = abs(pool->y2 - pool->y1);
draw_bbox_rgb565(camera_capture_image_rgb565, 320, 240, x, y, w, h, COLOR_RED, 3);
float face_ratio = 0;
int pad_left = 0, pad_top = 0;
CropResizeWithPad_RGB565(camera_capture_image_rgb565, w, h, (uint8_t *)in_i8_face, &face_ratio,
&pad_left, &pad_top);
memcpy(face_GetModelInputPtr_serving_default_input_0(), in_i8_face, 160 * 160 * 3);
face_RunModel(false);
int8_t *output_landmarks_buf = face_GetModelOutputPtr_PartitionedCall_2_70322(); // 136
dequantize_int8(output_landmarks_buf, out_f1_landmarks, 136, 0.05879964306950569, -2);
ULONG infer_end_time = tx_time_get();
ULONG infer_time_ms = (infer_end_time - infer_start_time) * (1000 / TX_TIMER_TICKS_PER_SECOND);
bool is_sleeping = check_drowsiness(&ear_tracker, out_f1_landmarks);
if (is_sleeping)
{
printf("WARNING: DROWSINESS DETECTED!\n");
}
for (int i = 0; i < 68; ++i)
{
float norm_x = out_f1_landmarks[i * 2 + 0];
float norm_y = out_f1_landmarks[i * 2 + 1];
float kp_abs_x = norm_x * 160.0f;
float kp_abs_y = norm_y * 160.0f;
int cx = (int)((kp_abs_x - pad_left) / face_ratio) + x;
int cy = (int)((kp_abs_y - pad_top) / face_ratio) + y;
draw_points_rgb565(camera_capture_image_rgb565, 320, 240, cx, cy, 2);
}
APP_PRINT("\r\n[AI] NPU Inference Time: %lu ms, detect face: %d\r\n", infer_time_ms, out_n);
}
tx_semaphore_put(&g_ai_done_sem);
}
}
}
void draw_bbox_rgb565(uint8_t *img_buffer, int img_w, int img_h, int x, int y, int w, int h, uint16_t color,
int line_width)
{
uint16_t *img = (uint16_t *)img_buffer;
/* clip */
if (x < 0)
{
w += x;
x = 0;
}
if (y < 0)
{
h += y;
y = 0;
}
if (x + w > img_w)
w = img_w - x;
if (y + h > img_h)
h = img_h - y;
if (w <= 0 || h <= 0)
return;
/* 画上下水平线 */
for (int lw = 0; lw < line_width; lw++)
{
if (y + lw < img_h)
{
for (int i = x; i < x + w; i++)
img[(y + lw) * img_w + i] = color;
}
if (y + h - 1 - lw >= 0)
{
for (int i = x; i < x + w; i++)
img[(y + h - 1 - lw) * img_w + i] = color;
}
}
/* 画左右垂直线 */
for (int lw = 0; lw < line_width; lw++)
{
if (x + lw < img_w)
{
for (int j = y; j < y + h; j++)
img[j * img_w + (x + lw)] = color;
}
if (x + w - 1 - lw >= 0)
{
for (int j = y; j < y + h; j++)
img[j * img_w + (x + w - 1 - lw)] = color;
}
}
}
/***************************************************************************
* @details 在 RGB565 图像缓冲区上绘制一个实心圆点 (人脸关键点)
* @param img_buffer: 图像缓冲区指针 (8位指针)
* @param img_w: 图像宽度
* @param img_h: 图像高度
* @param x: 圆心 X 坐标 (预测出的人脸关键点 X)
* @param y: 圆心 Y 坐标 (预测出的人脸关键点 Y)
* @param radius: 圆点半径
******************************************************************************/
void draw_points_rgb565(uint8_t *img_buffer, int img_w, int img_h, int x, int y, int radius)
{
uint16_t *pixels = (uint16_t *)img_buffer;
// 2. 计算Bounding Box
int x_min = x - radius;
int x_max = x + radius;
int y_min = y - radius;
int y_max = y + radius;
// 3. 边界检查:
if (x_min < 0)
x_min = 0;
if (x_max >= img_w)
x_max = img_w - 1;
if (y_min < 0)
y_min = 0;
if (y_max >= img_h)
y_max = img_h - 1;
int r2 = radius * radius;
// 4. 遍历包围盒,打点
for (int py = y_min; py <= y_max; py++)
{
for (int px = x_min; px <= x_max; px++)
{
int dx = px - x;
int dy = py - y;
if ((dx * dx + dy * dy) <= r2)
{
pixels[py * img_w + px] = LANDMARK_COLOR;
}
}
}
}
四、最终效果
[ram_thread] HyperRAM Initialized Successfully!
[net_thread] Waiting for PHY Auto-Negotiation (approx 2 seconds)...
Get flash id: 0x021a5befOSPI flash initialized successfully
[ram_thread] HyperFlash Initialized Successfully!
[ram_thread] HyperRAM Read/Write Test PASS! Data: 0xDEADBEEF
[net_thread] Checking Ethernet Link...
[Cam] MIPI Camera Init Success! Starting pipeline...
** [net_thread] Error: nx_ip_status_check failed! 0x43
[net thread] Ethernet link is up.
[net_thread] Waiting for Ethernet Link to be UP...
.....................
[net_thread] Ethernet Link is UP! IP Address Acquired.
[net_thread] Probing PC IP (ARP Resolve)...
[net_thread] Network Ready! Camera streaming allowed.
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 37 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
[AI] NPU Inference Time: 36 ms, detect face: 1
五、一些小问题
尽管推理时间仅需36ms,然而正如视频演示的,保持模型检测稳定需要一段时间,未来计划使用YOLO关键模型直接回归,实现检测框与关键点的直接获得,或许效果更优。最后受限于作者的开发水平,力有不逮之处,还请各位多多包涵,多多指教。

全部评论