Tensorflow基础学习TFRecord-DataSet学习

作者： Sam (甄峰) sam_code@hotmail.com

0. 缘由：

对深度学习而言，因为训练数据通常极为庞大，所以在提高运算能力的同时，更高效的数据I/O操作也非常重要。

使用Tensorflow训练模型时，有三种数据加载方式：

A. 使用python 代码准备数据。

B. 预先加载数据，将需要训练的数据以变量形式保存于内存。

C. 利用管道从文件中读取。

当数据较小时，可以用python 代码进行batch, shuffle, padding等numpy类型的数据处理。用placeholder+feed_dict 来将其导入到graph中变成Tensor类型数据。

但当数据数量远超内存大小时，就只能放到硬盘中，边用边读。这就必须考虑数据的移动，读取，处理等速度了。Tensorflow推荐使用自带的TFRecords文件，可以提高速度，节省空间。

TFRecords文件以二进制进行存储，适合以串行方式读取大批量数据。

Training时，可以编写程序将普通的训练数据保存为TFRecords数据格式。

1. TFRecords文件的创建和写入：

TFRecords是以字典(dict)方式保存feature. 字典的每条记录保存一个特征(feature). 把所有特征存入字典。

字典的Key记录feature的名字。字典的value则记录feature的内容。

1.1: Feature值存储类型：

Feature的值，必须是Tensorflow指定的feature类型中的一个：

int64, float32. string.

可以用以下方式构建feature值。

tf.train.Feature(int64_list=tf.train.Int64List(value= input))

tf.train.Feature(float_list = tf.train.FloatList(value=input))

tf.train.Feature(bytes_list=tf.train.BytesList(value=input))

注意： value传入的参数，必须都是list.

Tensorflow的Feature只接受List数据，如果特征数据类型为矩阵或者Image.该如何处理呢？

处理方式有2：

A. 转为List类型。将张量fatten成List.

B. 转为string类型。使用 .tostring()转为string类型。

但不管使用何种方式，都会丢失shape信息。

可以该样本写入字典时，加入一个新的feature. 它的值就是转换数据的shape. 方便使用数据时转回。

1.2： TFRecords文件的创建思路：

将一个样本数据的每一项特征组合成一个字典。将样本数据组装成一个Example对象。

这个对象遵循protocol buffer协议。将example对象序列化为字符串。

最终使用tf.python_io.TFRecordWrite()写入tfrecords文件。

例：

import numpy as np

import tensorflow as tf

from matplotlib import pyplot as plt

import matplotlib.image as mpimg

# 精度3位

np.set_printoptions(precision=3)

# 用于显示数据

def display(alist, show = True):

print('type:%s \nshape: %s' %(alist[0].dtype,alist[0].shape))

if show:

for i in range(3):

print('样本%d is: %s' %(i,alist[i]))

scalars = np.array([1,2,3],dtype=np.int64)

print('\n标量:')

display(scalars)

vectors = np.array([[0.1,0.1,0.1], [0.2,0.2,0.2], [0.3,0.3,0.3]], dtype=np.float32)

print('\n向量:')

display(vectors)

matrices = np.array([np.array((vectors[0],vectors[0])), np.array((vectors[1],vectors[1])), np.array((vectors[2],vectors[2]))],dtype=np.float32)

print('\n矩阵:')

display(matrices)

# shape of image：(h,w,3)

img=mpimg.imread('1.jpg')

tensors = np.array([img,img,img])

# show image

print('\n张量:')

display(tensors, show = False)

plt.imshow(img)

# open TFRecord file

writer = tf.python_io.TFRecordWriter('%s.tfrecord' %'test')

# we are going to write 3 examples，each example has 4 features：scalar, vector, matrix, tensor

for i in range(3):

# create dictionary

features={}

# write scalar ，type Int64，"value=[scalars[i]]" makes it to list

features['scalar'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[scalars[i]]))

# write vector，type float，it is list，so "value=vectors[i]"

features['vector'] = tf.train.Feature(float_list = tf.train.FloatList(value=vectors[i]))

# write matrix，type float，but its rank =2，tf.train.FloatList only takes list, so we can flatten it to list

features['matrix'] = tf.train.Feature(float_list = tf.train.FloatList(value=matrices[i].reshape(-1)))

# however the shape info will disappear. we can save shape as vector here

features['matrix_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=matrices[i].shape))

# write tensor，type float，rank =3，another way is to convert it to string

features['tensor'] = tf.train.Feature(bytes_list=tf.train.BytesList(value=[tensors[i].tostring()]))

# save shape

features['tensor_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=tensors[i].shape))

# feed dictionary to tf.train.Features

tf_features = tf.train.Features(feature= features)

# get an example

tf_example = tf.train.Example(features = tf_features)

# serialize the example

tf_serialized = tf_example.SerializeToString()

# write

writer.write(tf_serialized)

# close

writer.close()

说明：

A. tf.train.Feature()

构建一个feature值。feature可接受int64, float32,string.

features['scalar'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[scalars[i]]))

给features这个字典添加一项，key为scalar.

value为:int64类型。因为scalers[i]是个标量，而value只接受List. 所以加了方括号，变成list.

features['vector'] = tf.train.Feature(float_list = tf.train.FloatList(value=vectors[i]))

给features这个字典添加一项，key为vector.

value为：float. 因为vectors[i]本来就是list.所以此处不加[].

features['matrix'] = tf.train.Feature(float_list = tf.train.FloatList(value=matrices[i].reshape(-1)))

features['matrix_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=matrices[i].shape))

这里希望加一个特征，但这个特征是矩阵。所以需要转为List.

这里利用reshape(-1)把数据fatten为List. 则需要再加入一项，存取其shape.

matrices[i].shape也是个List.所以不加[].

features['tensor'] = tf.train.Feature(bytes_list=tf.train.BytesList(value=[tensors[i].tostring()]))

features['tensor_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=tensors[i].shape))

另一种转换，并添加shape.

B. tf_features = tf.train.Features(feature= features)

用存好特征的字典生成一个tf.train.Features.

C. 生成样本。

tf_example = tf.train.Example(features = tf_features)

把features生成样本。

D:序列化样本：

tf_serialized = tf_example.SerializeToString()

E: 写入样本：

writer.write(tf_serialized)

所以此时，TFRecords文件中存入了3个Example.

2. Dataset的使用：

Dataset(tf.data.Dataset) 相当于一个数据集object.

这个Dataset，可以直接从list导入数据，也可以通过TFRecord文件导入数据(主要方法)。

利用Dataset, 可以直接乱序(shuffle), batch, padding, epoch等操作。而不用再使用python代码去直接处理。

2.1：创建dataset:

dataset = tf.data.TFRecordDataset("test.tfrecord")

通过刚建立的test.tfrecord文件导入数据。

tf.data.TFRecordDataset()可以打开一个或多个tfrecord文件。生成一个dataset.

打开多个tfrecord文件：

例：打开目录内所有tfrecord文件：

train_files_names = os.listdir('train_file/')

train_files = ['/home/sam/palm/'+item for item in train_files_names]

dataset_train = tf.data.TFRecordDataset(train_files)

这样，就可以把多个tfrecord文件中的数据导入同一个dataset中了。

2.2：解析数据：

因为存入tfrecords文件中的数据，是序列化之后的Example. 所以，需要对数据进行解析后才能使用。也就是说，是写入时各种操作的逆操作。

2.2.1：解析函数：

2.2.1.1：样本解析字典：

要解析Example. 就要首先知道它包含样本的格式。由什么feature组成，feature value是何种格式保存的。

这里就用到了样本解析字典，字典的key为feature名，字典的value则是对应的feature解析方式。

解析方式有两种：

A. 定长特征解析：

tf.FixedLenFeature(shape, dtype, default_value=None)

用于解析固定长度特征。只要知道数据类型和shape，那数据长度就固定了。

所以参数shape用来指定把此feature的数据解析为何种shape. 而dtype则指定数据的类型，类型是tf.float32, tf.int64, tf.string中的一种(与feature value取值部分对应)。

B.不定长特征解析：

tf.VarLenFeature(dtype)

样本解析字典：

dics = {'scalar': tf.FixedLenFeature(shape=(), dtype=tf.int64, default_value=None),

# when parse the example, shape below can be used as reshape, for example reshape (3,) to (1,3)

'vector': tf.FixedLenFeature(shape=(1,3), dtype=tf.float32),

# we can use VarLenFeature, but it returns SparseTensor

'matrix': tf.VarLenFeature(tf.float32),

'matrix_shape': tf.FixedLenFeature(shape=(2,), dtype=tf.int64),

# tensor在写入时使用了toString()，shape是()

# we first set the type as tf.string, then change to its original type: tf.uint8

'tensor': tf.FixedLenFeature(shape=(), dtype=tf.string),

'tensor_shape': tf.FixedLenFeature(shape=(3,), dtype=tf.int64)}

和之前写入时一一对应。

其中 Key=vector项，写入时，数据shape为(3,), 解析时，修改为(1,3).

2.2.2: 解析一个Example:

tf.parse_single_example(

serialized,

features,

name=None,

example_names=None,

)

serialized: a single serialized Example

features: 样本解析字典。

返回一个字典，key是feature名。 value是feature值。其中value的格式，由样本解析字典定。

2.2.3：初步处理feature:

因为一些数据是转换为string, 或者因为使用不定长特征解析(解析出的是SparseTensor)。要转换回来。还有就是要改变shape.

利用解析函数，生成新的dataset. 这个数据集，是被解析过得。

new_dataset = dataset.map(parse_function)

建立迭代器，每次取出一个Example.

iterator = new_dataset.make_one_shot_iterator()

next_element = iterator.get_next()

利用dataset做shuffle：


shuffle_dataset = new_dataset.shuffle(buffer_size=10000)
iterator = shuffle_dataset.make_one_shot_iterator()
next_element = iterator.get_next()

batch:

batch_dataset = shuffle_dataset.batch(4)iterator = batch_dataset.make_one_shot_iterator()next_element = iterator.get_next()

Tensorflow基础学习TFRecord-DataSet学习

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本