- 最后登录
- 2017-9-18
- 注册时间
- 2011-1-12
- 阅读权限
- 90
- 积分
- 12276
  
- 纳金币
- 5568
- 精华
- 0
|
1 Introduction
With the recent advancements in film and video industry, translation
of the script has become increasingly important. Although a
dubbing method was often used in the past, the translated subtitles
are taking advantages of the actor’s voice. However, subtitles usually
appear at a fixed position, for example, at the bottom or the
right side of the screen. For this reason, viewers can miss the flow
of visual contents, and feel some disturbing factors to own attention.
In this work, a new video subtitle systsem that uses word balloons
is presented. Word balloons are currently used only for still
images such as comics, and automatic word-balloon system for images
are presented [Chun et al. 2006]. Therefore it is required to
define a system for video word-balloons which can preserve contents
of video, and convey words of characters efficiently. Our goal
is to superimpose word-balloons including script over a video rather
than general captions. To do this, where to locate word-balloons in
the screen is solved. The most important thing is to determine the
optimized position of word-balloons. Also the size, the horizontal
and vertical proportions, and motion of word-balloons should
be determined reasonably. An automatic face-script mapping system,
which empolys machine learning of faces and voices, has been
developed to minimize user interactions.
Figure 2: User authoring interface.
shepherd@cs.yonsei.ac.kr
yjason0720a@gmail.com
zalleykat@cs.yonsei.ac.kr
xkys71015@gmail.com
{hrkim@cs.yonsei.ac.kr
kiklee@yonsei.ac.kr
2 Description
Video data and a caption file are provided by the user as input data.
The time to appear a word-balloon should be applied simultaneously
with the time of corresponding caption text. The words are
simply ordered on the basis of the time in the caption file, but there
is no information about who is owner of this words. Therefore,
additional user inputs are required. If the user is to map all the captions
to each character, it will be too boring and a labor intensive
process. In our system, user interactions are required only for a
small part of the entire video.
From the given user input data, we determine a classifier that can
automatically map the speaker for each caption. A given video is
segmented into several scenes based on color distribution, and some
of them are used to get some user interaction. The face regions of
each character are automatically detected by face detection algorithm,
and a user can map each caption to the corresponding face.
These mapping data are entered into the learning system to generate
a classifier. With the caption mapping data and the information of
face positions, our system calculates the optimized position of the
word balloon for each caption. The optimization function finds a
position that is near the corresponding speaker’s face and does not
overlap with any other face or any important objects in the scene.
The Nelder-Mead simplex method is used to find an approximate
optimal position of a word balloon.
3 Result and Future Work
Figure 1 shows an example of the results of our system. A user
enters a video data file and caption file into the authoring system,
as shown in Figure 2. The time required for our word balloon authoring
system depends on the resolution and length of the video
frames. Our system can process 1 frame per second when preprocessing
with a resolution of 632x352 and the performance naturally
drops for frames of a larger resolution. The preprocessing involves
face detection and user interaction, with most of the time required
for face detection. We aim to prove the utility of our system through
various experiments and user study in the future.
Acknowledgements
This work was supported by the National Research Foundation of Korea(NRF) grant
funded by the Korea government(MEST) (No. 2011-0028568). |
|