Temporal position of extracted feature frames

Dear Mr. Kobayashi and Mr. Toda,
I found following issue while rewriting sprocket to C++:
In feature extractor, hop size is given in milliseconds, while in synthesizer, it is given in frames. This results in imperfect alignment of analysis and synthesis frames.

For example for my 1047375 samples long file, pyworld extracts 9501 frames with 5ms frame shift, while MLSADF from pysptk would split the file into 9512 frames. This can result in smeared transients, especially in very long audio files.

Workarounds I found: since I only use DIFF_VC, I completely avoid pyworld and splice file into frames and extract mcep myself:

In feature_extractor.py

shiftl = int(self.fs / 1000 * self.shiftms)
frame_count = int((len(x)-self.fftl)/shiftl)
_mcep = np.zeros([frame_count,dim+1],dtype=np.float)
window_function = np.hanning(self.fftl)
for i in range(frame_count):
   frame_pos = i * shiftl
   frame = x[frame_pos:frame_pos+self.fftl]
   if len(frame)==self.fftl:
      _mcep[i] = pysptk.mcep(frame*window_function, dim, alpha)

When using both VC and DIFF_VC, I recommend saving time_axis extracted from pyworld, converting it to sample positions and then using in synthesizer - cannot use high-level synthesis interface anymore.

Hope this information is interesting to you. Thank you for your amazing work,

Best regards, Mart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Temporal position of extracted feature frames #129

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Temporal position of extracted feature frames #129

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions