作者:Sam (甄峰) sam_code@hotmail.com
之前在一些ARM CPU下,曾在编译时指定过Neon。
0. Neon简介:
0.1: 简介:
ARM Advanced
SIMD延伸集,(ARM
Cortex-A系列处理器的128位SIMD架构扩展)称为NEON技术,它是一个结合64bit和128bit的SIMD(Single
Instruction Multiple Data单指令多数据)指令集。其针对多媒体和讯号处理程式具备标准化加速的能力,NEON具有一组广泛的指令集、各自的寄存器阵列,以及独立执行的硬件。ARM
ENON技术可加速多媒体和信号处理算法(如视频编码/解码、2D/3D图形、游戏、音频和语音处理、图像处理技术、电话和声音合成)。
NEON的寄存器:有16个128位四字寄存器Q0-Q15,32个64位双字寄存器D0-D31,两个寄存器是重叠的,在使用的时候需要特别注意,不小心就会被覆盖掉。
0.2: SIMD:
通常我们进行多媒体处理的时候,很多的数据都是16位或者8位的,如果这些程序运行在32位的机器上,那么计算机有一部分的计算单元是没有工作的.所以这是一种浪费.为了更好的使用那些被浪费的资源.SIMD就应运而生了.SIMD这种技术就是使用一条指令,但对多个相同类型和尺寸的数据进行并行处理.就像我们现实生活中的好几个人都在做同一件事情那样,这样就可以将速度提升很多倍
0.3:使用Neon的方式:
按Sam的理解,使用Neon的方式有以下几种:
A:使用C的neon内联函数。
B:直接使用Neon汇编指令。
C:使用某些第三方库如OpenMAX.
各自的优缺点:
A:使用Neon intrinsics函数,可以在直接接触ASM的情况下,使用Neon。这些函数被定义在:
arm_neon.h
中。
类似于:
vadd_s8
(int8x8_t __a, int8x8_t __b)
此方法需要注意2点:
1.必须: #include
2.编译时必须加入; -mfloat-abi=softfp -mfpu=neon
B:使用汇编指令:
效率最高. 使用intrinsics没法控制寄存器分配和内存对齐等。
1. Android NDK下
Neon的探索:
Sam着重研究的是使用intrinsics函数。研究标本为:/opt/android-ndk-r10b/samples/hello-neon/jni
为了不遗漏任何关键点,Sam从空文件main.cpp中开始写代码。
可以看到,代码基本全是从例子copy过来的。#include
必须要有。
main.cpp:
#include
#include
#include
#include
#include
#include
#include
#include
static void fir_filter_c(short *output, const short* input,
const short* kernel, int width, int kernelSize);
static double now_ms(void);
void fir_filter_neon_intrinsics(short *output, const short*
input, const short* kernel, int width, int kernelSize);
//#include
#include "cpu-features.h"
#define FIR_KERNEL_SIZE
32
#define FIR_OUTPUT_SIZE
2560
#define FIR_INPUT_SIZE
(FIR_OUTPUT_SIZE + FIR_KERNEL_SIZE)
#define FIR_ITERATIONS
600
static const short
fir_kernel[FIR_KERNEL_SIZE] =
{
0x10, 0x20, 0x40, 0x70,
0x8c, 0xa2, 0xce, 0xf0, 0xe9, 0xce, 0xa2, 0x8c, 070, 0x40, 0x20,
0x10,
0x10, 0x20, 0x40, 0x70,
0x8c, 0xa2, 0xce, 0xf0, 0xe9, 0xce, 0xa2, 0x8c, 070, 0x40, 0x20,
0x10 };
static short
fir_output[FIR_OUTPUT_SIZE];
static short
fir_input_0[FIR_INPUT_SIZE];
static const short* fir_input = fir_input_0 +
(FIR_KERNEL_SIZE/2);
static short
fir_output_expected[FIR_OUTPUT_SIZE];
static double now_ms(void)
{
struct timespec
res;
clock_gettime(CLOCK_REALTIME, &res);
return 1000.0*res.tv_sec
+ (double)res.tv_nsec/1e6;
}
static void fir_filter_c(short *output, const short* input,
const short* kernel, int width, int kernelSize)
{
int
offset = -kernelSize/2;
int
nn;
for (nn = 0; nn <
width; nn++) {
int sum = 0;
int mm;
for (mm = 0; mm < kernelSize; mm++) {
sum +=
kernel[mm]*input[nn+offset+mm];
}
output[nn] = (short)((sum + 0x8000) >>
16);
}
}
void fir_filter_neon_intrinsics(short *output, const short*
input, const short* kernel, int width, int kernelSize)
{
#if 1
int nn, offset =
-kernelSize/2;
for (nn = 0; nn <
width; nn++)
{
int mm, sum = 0;
int32x4_t sum_vec = vdupq_n_s32(0);
for(mm = 0; mm < kernelSize/4; mm++)
{
int16x4_t
kernel_vec = vld1_s16(kernel + mm*4);
int16x4_t
input_vec = vld1_s16(input +
(nn+offset+mm*4));
sum_vec =
vmlal_s16(sum_vec, kernel_vec, input_vec);
}
sum += vgetq_lane_s32(sum_vec, 0);
sum += vgetq_lane_s32(sum_vec, 1);
sum += vgetq_lane_s32(sum_vec, 2);
sum += vgetq_lane_s32(sum_vec, 3);
if(kernelSize & 3)
{
for(mm =
kernelSize - (kernelSize & 3); mm < kernelSize; mm++)
sum += kernel[mm] *
input[nn+offset+mm];
}
output[nn] = (short)((sum + 0x8000) >>
16);
}
#else
int nn, offset =
-kernelSize/2;
for (nn = 0; nn <
width; nn++) {
int sum = 0;
int mm;
for (mm = 0; mm < kernelSize; mm++) {
sum +=
kernel[mm]*input[nn+offset+mm];
}
output[n] = (short)((sum + 0x8000) >>
16);
}
#endif
}
int main(int argc, char** argv)
{
uint64_t features;
char buffer[512];
char tryNeon = 0;
double
t0, t1, time_c, time_neon;
if (android_getCpuFamily() != ANDROID_CPU_FAMILY_ARM) {
printf("\nCPU Not ARM\n");
return -1;
}
features = android_getCpuFeatures();
if ((features & ANDROID_CPU_ARM_FEATURE_ARMv7) == 0)
{
printf("\nNot an ARMv7 CPU !\n");
return -1;
}
if ((features &
ANDROID_CPU_ARM_FEATURE_NEON) == 0) {
printf("\nCPU doesn't support NEON !\n");
return -1;
}
printf("\nFeature: 0x%llx\n", features);
{
int nn;
for (nn = 0; nn < FIR_INPUT_SIZE; nn++)
{
fir_input_0[nn] = (5*nn) & 255;
}
fir_filter_c(fir_output_expected, fir_input,
fir_kernel, FIR_OUTPUT_SIZE, FIR_KERNEL_SIZE);
}
t0 = now_ms();
{
int count =
FIR_ITERATIONS;
for (; count > 0; count--) {
fir_filter_c(fir_output, fir_input, fir_kernel, FIR_OUTPUT_SIZE,
FIR_KERNEL_SIZE);
}
}
t1 = now_ms();
time_c = t1 - t0;
printf("\nFIR Filter benchmark:\nC version
: %g ms\n", time_c);
printf("\n\nNeon version : ", sizeof
buffer);
t0 = now_ms();
{
int count =
FIR_ITERATIONS;
for (; count > 0; count--) {
fir_filter_neon_intrinsics(fir_output, fir_input, fir_kernel,
FIR_OUTPUT_SIZE, FIR_KERNEL_SIZE);
}
}
t1 = now_ms();
time_neon = t1 -
t0;
printf(" %g ms (x%g
faster)\n", time_neon, time_c / (time_neon < 1e-6 ? 1. :
time_neon));
return 0;
}
Android.mk:
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
LOCAL_ARM_MODE := arm
LOCAL_MODULE := Test_Neon
LOCAL_STATIC_LIBRARIES := cpufeatures
LOCAL_SRC_FILES := main.cpp
LOCAL_CFLAGS := -DHAVE_NEON=1
-I/opt/android-ndk-r10b/sources/android/cpufeatures
-mfloat-abi=softfp -mfpu=neon
LOCAL_CXXFLAGS := -DHAVE_NEON=1
-I/opt/android-ndk-r10b/sources/android/cpufeatures
-mfloat-abi=softfp -mfpu=neon
LOCAL_LDLIBS :=
-L/opt/android-ndk-r10b/sources/android/libs/armeabi-v7a
-lcpufeatures
include $(BUILD_EXECUTABLE)
Application.mk:
# Build both ARMv5TE and ARMv7-A machine code.
APP_PLATFORM = android-8
APP_ABI := armeabi-v7a
#APP_ABI := $(ARM_ARCH)
#Sam modify it to release
APP_OPTIM := release
#APP_OPTIM := debug
#APP_OPTIM = $(MY_OPTIM)
APP_CPPFLAGS += -fexceptions
APP_CPPFLAGS += -frtti
#sam modify it from gnustl_static to gnustl_shared
#APP_STL := gnustl_static
#APP_STL := gnustl_shared
APP_STL := gnustl_shared
#APP_CPPFLAGS += -fno-rtti
#
APP_CPPFLAGS += -Dlinux -fsigned-char
APP_CFLAGS += -fsigned-char
#APP_CPPFLAGS += $(MY_CPPFLAGS) -Dlinux
#STLPORT_FORCE_REBUILD := true
编译过程会遇到以下几个问题:
1. android_getCpuFeatures系列函数未声明:
则加入#include
"cpu-features.h"
并在Android.mk中加入其路径:-I/opt/android-ndk-r10b/sources/android/cpufeatures
2. android_getCpuFeatures系列函数未定义:
进入:/opt/android-ndk-r10b/sources/android/cpufeatures
并编译之,Sam修改Android.mk, 把它编译成动态库。并保持Application.mk统一配置。
在K1平台上,运行结果是:
./Test_Neon
Feature: 0x7ff
FIR Filter benchmark:
C version
: 117.865
ms
Neon version : 26.5964 ms
(x4.43163 faster)
显示对FIR Filter,使用Neon有4倍速度。
2.
如何判断当前平台是否支持Neon指令集:
2.1: 字符界面查看:
cat /proc/cpuinfo
在Features 项目中,看是否包含neon.
2.2:编程查看:
利用上一节中编译出的libcpufeatures.so 库得到CPU Features信息:
if (android_getCpuFamily() != ANDROID_CPU_FAMILY_ARM) {
printf("\nCPU Not ARM\n");
return -1;
}
features = android_getCpuFeatures();
if ((features & ANDROID_CPU_ARM_FEATURE_ARMv7) == 0)
{
printf("\nNot an ARMv7 CPU !\n");
return -1;
}
if ((features &
ANDROID_CPU_ARM_FEATURE_NEON) == 0) {
printf("\nCPU doesn't support NEON !\n");
return -1;
}