[其它] Video word balloon authoring system

[复制链接]

彬彬

797 主题	1 听众	1万积分

资深设计师

Rank: 7 Rank: 7 Rank: 7

纳金币: 5568
精华: 0

电梯直达

楼主

发表于 2012-1-3 10:31:15 |只看该作者 |倒序浏览

1 Introduction

With the recent advancements in film and video industry, translation

of the script has become increasingly important. Although a

dubbing method was often used in the past, the translated subtitles

are taking advantages of the actor’s voice. However, subtitles usually

appear at a fixed position, for example, at the bottom or the

right side of the screen. For this reason, viewers can miss the flow

of visual contents, and feel some disturbing factors to own attention.

In this work, a new video subtitle systsem that uses word balloons

is presented. Word balloons are currently used only for still

images such as comics, and automatic word-balloon system for images

are presented [Chun et al. 2006]. Therefore it is required to

define a system for video word-balloons which can preserve contents

of video, and convey words of characters efficiently. Our goal

is to superimpose word-balloons including script over a video rather

than general captions. To do this, where to locate word-balloons in

the screen is solved. The most important thing is to determine the

optimized position of word-balloons. Also the size, the horizontal

and vertical proportions, and motion of word-balloons should

be determined reasonably. An automatic face-script mapping system,

which empolys machine learning of faces and voices, has been

developed to minimize user interactions.

Figure 2: User authoring interface.

shepherd@cs.yonsei.ac.kr

yjason0720a@gmail.com

zalleykat@cs.yonsei.ac.kr

xkys71015@gmail.com

{hrkim@cs.yonsei.ac.kr

kiklee@yonsei.ac.kr

2 Description

Video data and a caption file are provided by the user as input data.

The time to appear a word-balloon should be applied simultaneously

with the time of corresponding caption text. The words are

simply ordered on the basis of the time in the caption file, but there

is no information about who is owner of this words. Therefore,

additional user inputs are required. If the user is to map all the captions

to each character, it will be too boring and a labor intensive

process. In our system, user interactions are required only for a

small part of the entire video.

From the given user input data, we determine a classifier that can

automatically map the speaker for each caption. A given video is

segmented into several scenes based on color distribution, and some

of them are used to get some user interaction. The face regions of

each character are automatically detected by face detection algorithm,

and a user can map each caption to the corresponding face.

These mapping data are entered into the learning system to generate

a classifier. With the caption mapping data and the information of

face positions, our system calculates the optimized position of the

word balloon for each caption. The optimization function finds a

position that is near the corresponding speaker’s face and does not

overlap with any other face or any important objects in the scene.

The Nelder-Mead simplex method is used to find an approximate

optimal position of a word balloon.

3 Result and Future Work

Figure 1 shows an example of the results of our system. A user

enters a video data file and caption file into the authoring system,

as shown in Figure 2. The time required for our word balloon authoring

system depends on the resolution and length of the video

frames. Our system can process 1 frame per second when preprocessing

with a resolution of 632x352 and the performance naturally

drops for frames of a larger resolution. The preprocessing involves

face detection and user interaction, with most of the time required

for face detection. We aim to prove the utility of our system through

various experiments and user study in the future.

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant

funded by the Korea government(MEST) (No. 2011-0028568).