Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[src] Lots of changes: first stab at kaldi10 (non-compatible version of kaldi) #3083

Open
wants to merge 176 commits into
base: master
Choose a base branch
from

Conversation

danpovey
Copy link
Contributor

The name of this may change (e.g. could be kaldi2, or more likely kaldi 6.0).

What this is is a "cleaned up" version of Kaldi where we abandon any notion of back
compatibility with previous versions, remove a lot of little-used functionality, and
fix a lot of mistakes that were made in the past. The idea is to have a "big bang"
to do all the things that we wanted to do in the past, or considered doing, but were
prevented from doing by back compatibility considerations.

Note: this PR will not compile, the code needs a lot of work, it's just to show the direction.

Some of the changes included here are:

  • Reworking of how topologies and HMM transition probabilities are dealt with to make the transition probabilities non-trainable, and have an FSA-based and not HMM-based framework for the topology; this leads to substantial simplifications in graph-related code.
  • Substantially simplify the feature extraction code; remove weird edge effects on num-frames that we inherited from HTK; remove unnecessary, strange or problematic feature extraction options such as dithering, preemphasis and liftering; remove support for little-used feature extraction options.
  • Remove nnet1; nnet2; sgmm; regression-tree (f)MLLR (unless there is an outcry); fmpe. Eventually I'd like to remove nnet3 -- replacement TBD -- but it will be needed at least for testing in the meantime.

I need help on this with the files topology.cc and transitions.cc, to implement what I sketched out in the headers; and to get everything to compile.

The biggest change that I would like to make (and we should probably get the existing changes tested first), is a major refactoring of the Vector and Matrix and related classes, and the addition of an ndarray-type or Tensor-type class to handle tensors, that would be similar to NumPy and PyTorch. Some of the changes I would like to make include:

  • Enable a stride in vectors and a column-stride in matrices (this is supported by the underlying BLAS, so should not be very hard).
  • Possibly, via templates, add support for other types such as ints and float16's. For int support, our templated array types would replace our existing CUDA-int-array thing in cu-array.h. To avoid needing to implement everything for all types, we would only implement type/operation combinations as needed.
  • Change how CuMatrix/CuVector work, so that they are not used by users directly unless they know for sure that CUDA is being used. For the "might-be-CPU, might-be-CUDA" scenario, there would be a more generic class with a "device" (and maybe also a dtype), like a PyTorch array.

For the implementation of the Tensor class, we would try wherever possible to leverage existing code from the Vector/Matrix classes, i.e. those types would be used in the implementation where possible. This (with suitable loops) will cover all the common cases, and we can optimize things whenever we notice a bottleneck.

It would not be very hard to add automatic differentiation capability to these things by following the PyTorch/ArrayFire model; it's really not that many lines of code, thanks to C++11 magic. I would need help with the Python wrapping, but by closely following PyTorch, it wouldn't be that much work.

I looked into maybe using PyTorch's internal mechanisms for our own array stuff, but decided that it's too hard to compile, too poorly documented, and in too rapid a state of flux for us to rely on it right now. But we will definitely aim for interoperability with it as much as we can.

I do not rule out major changes to the scripting, the I/O or even to the whole approach to how Kaldi is used as a library, either. The idea is that we have a 'window' of maybe a few months where major changes are permitted, and back compatibility is not a concern (like the movie `The Purge', except for much longer).

@dogancan
Copy link
Contributor

All of these planned changes sound great!

Krombopulos Michael

@kkm000
Copy link
Contributor

kkm000 commented Mar 11, 2019

I'd also normalize audio to [-1.0, 1.0]. The HTK-inherited 65536.0 multiplier is rather baffling to DSP folk.

@danpovey
Copy link
Contributor Author

danpovey commented Mar 11, 2019 via email

@kkm000
Copy link
Contributor

kkm000 commented Mar 11, 2019

And maybe it's time to use more of c++11? Such things as the unique_ptr is a super-win! The thing with it is one can adopt a convention as to who owns the pointer: if you see a T * anywhere, it's an unowned pointer, and a unique_ptr<T> (or share_ptr where it needs to be shared) represents an owned pointer. This is what I always do in my code, and it really works, and does not cause much headache. I know that it was always frustrating when you try to e. g. add to a vector of pointers and get a "deleted constructor" on a unique pointer class. This never happens in modern c++1x (the emplace_*() methods and perfect forwarding take care of this) It's has a very consistent stdc++ library, made consistent and with the language user in mind. E.g. pre-final standard made a mess as to what constructors are deleted if you declare an argumentful constructor on a class. Final c+11 solved that in an very natural way. And you rarely if ever need to mess with templates to generalize stuff over the standard library, as you had with e. g. STL adapter classes; it just works, surprisingly. They went a long way to make a single consistent thing of both the language and the library so they work smoothly together. I may sound a bit gung-ho, but I am totally a fan!

Also, it's certainly time to get rid of system-dependent threads, mutexes, timers and rand(). c++ memory model is certainly sufficient not only for multithreading, but also for advanced HPC stuff, such as handling lock-free multi-core access to data.

To give you a baseline, when I started our new prod server (based on Kaldi decoding pathway), I used only standard c++11 features, and compiled it in Windows and Linux (with separate small system-dependent directories, for stuff like logging and daemonizing). This was back at the time when I had only VS 2013, and its support for c++11 was quite buggy; I removed all workaround since then. So it really works, and entirely readable.

And if I had a say, I'd definitely got rid of the explicit sesquipedalian iterator type names in favor of auto! IMHO, they rather subtract from readability than add to it. I'll be happy to write a small tutorial for the readers of the code (like grads and library adopters) that are unfamiliar with the new features. But there are probably fewer fresh-backed grads these days anyway who are familiar with C++ but not the 11.

@kkm000
Copy link
Contributor

kkm000 commented Mar 11, 2019

Do you mind if I create a GitHub "project" for this development? I never played with them, but they are supposedly helpful in separating the normal Kaldi activity from the entirely new development. Try and see? I tried labeling the issues, but that does not make them amazingly visible at a glance.

@danpovey
Copy link
Contributor Author

danpovey commented Mar 11, 2019 via email

@kkm000
Copy link
Contributor

kkm000 commented Mar 11, 2019

There is a global _RandMutex in Kaldi. I do not remember the file, somewhere in base/. Multitreaded code uses kaldi::RandomState instead, which would be much cleaner if it were a class with methods.

c++11 has also different random distributions, but I would not touch that without running serious statistical A/B tests vs existing code.

I'll get rid of most system-dependent stuff, it's no longer needed with modern C++.

BTW, the reason we use Log() and Exp() instead of log and exp is to work around a performance issue with single-precision log (IIRC) on a particular version of clib.. I'm not sure how common that version still is.

Ah, you reminded me of another thing to fix... maybe. Or retire, I'm really unsure. While doing my various runs of Kaldi w/different compilers, I noticed that on a fast CPU the result of the exp/expf speed test comes out rather randomly, even with the same compier. The Moore's law hit us: the test makes too few iterations to be significant.

The larger problem with it, though, is that the modern CPU's clock varies wildly depending on its load. The timing of 2 short identical loops differ: the first loop runs slower if ran on an idle CPU. A proper test would secure core affinity, then 100%-loaded it for at least 1ms or so, so the CPU notices and bump the clock up, and only then would run the actual comparison loops. And... is it still worth it?

On a machine with the intel_pstate governor, idle machine:

$ for f in `seq 5`; do sleep 1; cpupower frequency-info | grep 'call to kernel'; done
  current CPU frequency: 1.25 GHz (asserted by call to kernel)
  current CPU frequency: 1.25 GHz (asserted by call to kernel)
  current CPU frequency: 1.24 GHz (asserted by call to kernel)
  current CPU frequency: 1.29 GHz (asserted by call to kernel)
  current CPU frequency: 1.30 GHz (asserted by call to kernel)

Hot (last line of the same loop, 500 iterations w/o the sleep)

$ { for f in `seq 500`; do cpupower frequency-info | grep 'call to kernel'; done; } | tail -1
  current CPU frequency: 4.00 GHz (asserted by call to kernel)

What fascinates me the most about the modern CPU is not so much it's pipelined architecture (that's been invented half a century ago), but rather it's clock and power management circuitry.

@sikoried
Copy link
Contributor

sikoried commented Mar 11, 2019

  • It would be great to have an official library target to simplify the use in custom projects and to minimize the needs for custom forks (which can be a pain to maintain).
  • I share kkm's view on consistent use of c++1x, in particular w.r.t. to the pointers and the system-dependent things (threads, ...)
  • As for stripping out functionality: In principle, I think that's a good idea (people can still go back to certain versions if they need to). It would be great to have a document on "what's been in kaldi at some point (revision hashes) and why was it taken out", maybe even linked to papers that lead to the findings.
  • Re python bindings: I know everyone is onto python and jupyter these days, but I wonder if it's necessarily the best option out there. Maybe consider Lua?

edit: removed comment on bazel and moved it over to the resp. thread.

@KarelVesely84
Copy link
Contributor

Hello, well, this really is a big change.

And it is not easy to accept that 'nnet1' should be removed given that I implemented it. But nevertheless, it was fun working on it back in 2011-2013, and it did a good job for a long period of time.
The feature extraction with HTK-compatibility was a good thing too, and maybe it would be still possible to optimize the code better if there are performance bottlenecks (one random number generator per thread seeded by global one, or similar).

On the other hand it is understandable that the backward compatibility over a long time brings many severe limitations. And in the long term, changes need to be made not to stay frozen in the past...

I hope this was not too pesimistic, and yes, let the c++11 rule! (or maybe c++17?)

@KarelVesely84
Copy link
Contributor

Other possible issue is that python 2.7 is coming to its end of life. And many library
developers already committed not to produce new versions for python 2.7 anymore,
including NumPy. So far we are good, but in the long term, it is quite possible that
problems will come...

@jtrmal
Copy link
Contributor

jtrmal commented Mar 11, 2019 via email

@danpovey
Copy link
Contributor Author

danpovey commented Mar 11, 2019 via email

@hhadian
Copy link
Contributor

hhadian commented Mar 12, 2019

If you are going to update the scripts too in kaldi10, I think it's a good idea to move all the shared scripts (e.g. steps and utils) to /scripts.

@danpovey
Copy link
Contributor Author

danpovey commented Mar 12, 2019 via email

@kkm000
Copy link
Contributor

kkm000 commented Mar 13, 2019

@danpovey

For Kaldi's own binaries, I think it's nice to retain the ability to only link against subsets of things, assuming it would be possible to keep that but also make a "fat" library.

Amen to that!

I will be using things like shared_ptr when I reorganize the matrix/vector library.

My own treatment of the shared_ptr is when I think need it, I always start by looking for a way out to write code such that I don't. They have their niche, but IMO it's one of those stdc++ classes which are the easiest to misuse with the best intentions. Please be careful with it.

And I'll be happy to tie the loose ends in the base/ library. There are some, in Windows-specific code, mainly.

@vesis84

c++11 rule! (or maybe c++17?)

Among many useful features, stdc++17 adds a very efficient and lightweight std::string_view that I bewailed not having access to in the recent logging code rework. But this is the area where I'd be rather conservative than sorry. Besides, even CUDA 10 does not allow c++17 in nvcc at the moment. only 14.

@hhadian

I think it's a good idea to move all the shared scripts (e.g. steps and utils) to /scripts.

The current location under wsj/ betrays the long legacy, but if only we could think of a natural way to keep the scripts under /egs/X/Y/{utils,steps}! I'm sure there is so much code, inside and out of Kaldi codebase, that depends on relative paths to these.

@vdp
Copy link
Contributor

vdp commented Mar 13, 2019

I will be using things like shared_ptr when I reorganize the matrix/vector library.

My own treatment of the shared_ptr is when I think need it, I always start by looking for a way out to write code such that I don't. They have their niche, but IMO it's one of those stdc++ classes which are the easiest to misuse with the best intentions. Please be careful with it.

I'm not by any means a "modern C++" expert, but I second Kirill's warning about shared_ptr. As I think it was already mentioned some time ago, shared_ptr has hidden cost not only because of reference counting but also, and perhaps more importantly, due to its safety guarantees when used in multithreaded code. To quote a part of an answer from this SO thread:

If you want to learn more you can watch Nicolai M. Josuttis good talk about "The Real Price of Shared Pointers in C++" https://vimeo.com/131189627
It goes deep into the implementation details and CPU architecture for write barriers, atomic locks etc. once listening you will never talk about this feature being cheap. If you just want a proof of the magnitude slower, skip the first 48 minutes and watch him running example code which runs up to 180 times slower (compiled with -O3) when using shared pointer everywhere.

That's probably the reason why people talk about using alternatives for single threaded code.
Again, I'm hardly an authority on this, but I think it might be a good idea to confine the shared_ptr use to the parts of the codebase where it's really needed, and then only in the "outer loops" i.e. outside of code that must be really fast.

@galv
Copy link
Contributor

galv commented Mar 13, 2019 via email

@danpovey
Copy link
Contributor Author

@desh2608 can you please merge changes from master into kaldi10, resolving these merge conflicts and make a PR with those changes to the kaldi10 branch? (Now and whenever there is a conflict)... ask someone if you don't know how to resolve git conflicts.

@danpovey
Copy link
Contributor Author

FYI, about shared_ptr, I was actually going to use it pretty heavily in the new tensor stuff, as I was broadly following the design of flashlight which does use it.
PyTorch seems to use their own adaptation of boost intrusive_ptr, but (I think) modifed to support weak refcounts. They do use std::atomic to do synchronization, so it's not very cheap. I don't believe the advantages of intrusive pointers outweigh the added hassle of using them here, but it might be nice to have a version of std::shared_ptr that doesn't do the atomic stuff (because we'll mostly be using this stuff single threaded for now)... it could be made thread-safe using a compiler or runtime option maybe. Not sure what you guys think. For now I'll just draft the code with shared_ptr, as something like that is really needed to keep the complexity manageable.

@desh2608
Copy link
Contributor

@danpovey Done. #3105

* [src] Change warp-synchronous to cub::BlockReduce (safer but slower) (#3080)

* [src] Fix && and || uses where & and | intended, and other weird errors (#3087)

* [build] Some fixes to Makefiles (#3088)

clang is unhappy with '-rdynamic' in compile-only step, and the
switch is really unnecessary.

Also, the default location for MKL 64-bit libraries is intel64/.
The em64t/ was explained already obsolete by an Intel rep in 2010:
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/285973

* [src] Fixed -Wreordered warnings in feat (#3090)

* [egs] Replace bc with perl -e (#3093)

* [scripts] Fix python3 compatibility issue in data-perturbing script (#3084)

* [doc] fix some typos in doc. (#3097)

* [build] Make sure expf() speed probe times sensibly (#3089)

* [scripts] Make sure merge_targets.py works in python3 (#3094)

* [src] ifdef to fix compilation failure on CUDA 8 and earlier (#3103)

* [doc] fix typos and broken links in doc. (#3102)

* [scripts] Fix frame_shift bug in egs/swbd/s5c/local/score_sclite_conf.sh (#3104)
@vdp
Copy link
Contributor

vdp commented Mar 14, 2019

Not sure what you guys think.

Well, as I said I am not an expert in this... I hope Kirill, or someone else, has more experience and will be able to give a more informed opinion about the possible alternatives. My understanding is that the pointer dereferencing itself is low-cost, so I guess it all depends on how exactly you will be using the shared_ptrs. If, for example, the object creation/pass-by-value etc is limited to some sort of bookkeeping code, outside the really "hot" parts, then it probably will not be very costly.
I wonder if you could add a level of indirection here, using a typedef or something like that, so that after you finish prototyping the code, you could easily swap implementations in order to benchmark performance, test thread-safety etc.

@KarelVesely84
Copy link
Contributor

It is interesting, how much can one learn from following this thread...
I am thinking what to do with 'nnet1'. I'd like to keep it in some way, probably as a separate project.

And along the way, I am thinking how to experiment with splitting Kaldi into several git 'submodules', which could be stored as separate '--orphan' branches. I can imagine there could be a submodule for basic functionality 'src/base, src/matrix', another for 'src/lattice, src/decoder' and yet another for 'src/nnet' and another for 'script/' and another for 'egs/'. The modules could be activated/deactivated on demand by a script.

Like this 'kaldi' could be more modular and usable as a collection of loosely coupled libraries. Say if someone wants to use matrix library, but does not want to install whole project, he could associate the matrix module.

However, as it usually is, this flexibility would come for a price. Operating the repo would be more confusing for the inexperienced, as different git commands are used to operate submodules. Next, it would be more difficult to checkout the status of the whole project to a specific date back in time, as each submodule would have its own 'SHA1' code.

Some reading: https://github.blog/2016-02-01-working-with-submodules/

Do you have some opinion on this?

@KarelVesely84
Copy link
Contributor

Aha, it seems that going back it time with 'main' module and sub-modules is not difficult:
git checkout
git submodule update

And it seems that the submodules can also import the git-history from the original place.

@danpovey
Copy link
Contributor Author

danpovey commented Mar 14, 2019 via email

@galv
Copy link
Contributor

galv commented Mar 14, 2019 via email

danpovey and others added 2 commits March 15, 2019 15:55
* [src] Change warp-synchronous to cub::BlockReduce (safer but slower) (#3080)

* [src] Fix && and || uses where & and | intended, and other weird errors (#3087)

* [build] Some fixes to Makefiles (#3088)

clang is unhappy with '-rdynamic' in compile-only step, and the
switch is really unnecessary.

Also, the default location for MKL 64-bit libraries is intel64/.
The em64t/ was explained already obsolete by an Intel rep in 2010:
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/285973

* [src] Fixed -Wreordered warnings in feat (#3090)

* [egs] Replace bc with perl -e (#3093)

* [scripts] Fix python3 compatibility issue in data-perturbing script (#3084)

* [doc] fix some typos in doc. (#3097)

* [build] Make sure expf() speed probe times sensibly (#3089)

* [scripts] Make sure merge_targets.py works in python3 (#3094)

* [src] ifdef to fix compilation failure on CUDA 8 and earlier (#3103)

* [doc] fix typos and broken links in doc. (#3102)

* [scripts] Fix frame_shift bug in egs/swbd/s5c/local/score_sclite_conf.sh (#3104)

* [src] Fix wrong assertion failure in nnet3-am-compute (#3106)

* [src] Cosmetic changes to natural-gradient code (#3108)

* [src,scripts] Python2 compatibility fixes and code cleanup for nnet1 (#3113)

* [doc] Small documentation fixes; update on Kaldi history (#3031)

* [src] Various mostly-cosmetic changes (copying from another branch) (#3109)
@kkm000
Copy link
Contributor

kkm000 commented Mar 16, 2019

@vesis84

Do you have some opinion on [Git submodules]?

I do, and I must admit it's not a favorable one. Using submodules and doing it right every time is a sign of the genuine Git-fu mastery at a black-belt level. Essentially, every submodule is a repo on its own, with its own set of changed files, and developing in this setup requires much more effort: keeping track of what you are changing in the module, sending upstream patches separately, keeping your local changes unpublished until the patch is accepted to the submodule repo. pulling it and resolving conflicts if there were review changes. They help when you maintain your own little patches of the third-party code, but keep their release in sync with it, but this may become taxing if your patches are extensive, and a divergence grows.

This is all assuming you are grafting a third-party module into your tree; I cannot think of a good scenario using Git submodules in a single-owned project. This is not a casual thing; I believe the rule of thumb should be if you can avoid using submodules at the price of a lesser hassle than they are, or, even better, twice the hassle of what you think they are, do avoid them by all means--because you'll meet unintended consequences you could not even think of in the first place (unless you already have the black belt and can assess all these subtleties in advance). And since our contributors are of all the varying levels of Git proficiency, it will end up with an increased burden on the maintainers, realistically.

The nnet1 code is not going away from the Git history ever; you can always "go back in time" to get a working tree with it, and fork off from any past point, modernize it and use, should a need arise. I may be wrong, but I do not think you are actively developing it as of now. I'd think first of the objective value in having it available and maintained as part of the toolkit. I sincerely do not know the answer, this is not a rhetorical question at all: how likely it is that someone would select nnet1 over a CTC nnet3 model if they start a new recipe? Is there any practical reason for that? If there is a compelling general interest for the user, then it's a reason to keep and maintain it. If that's unlikely at all, what is the point of maintaining it then. Our engineering trade is all about cost/benefit and rationally finding a nearly-the-best compromise, in the end.


I know perfectly that I'm an end stage incurable introvert, so the workings of my inner cogs won't transfer to everyone; this is not an unasked-for advice at all, please do not treat it this way. But I feel it maybe a fitting moment to share it, although it may feel unnecessarily personal, and I am sorry if it does. My own experience with chopping off my old code has been that it has already had served its purpose: I've learned from it and know more than I did before I made it. I may keep it for reference to remind me of the ideas and techniques that I may reuse, but its chief value lies with the fact (or, well, a conviction) that what I've worked out in the past have already added to my experience and made me a better hacker, and this is not going away, even when the bits themselves do—and this is what really matters. For me, these bits' best intended purpose had been already served. I wish I've learned CUDA as well as you have! : )

@naxingyu
Copy link
Contributor

nnet-get-feature-transform is required when computing preconditioning matrix in steps/libs/nnet3/train/frame_level_objf/common.py

@naxingyu
Copy link
Contributor

I added get feature transform to nnet3 and nnet3bin. And removed nnet3-am-train-transition from python libs. Now training runs. I'll see if the results are correct.

@naxingyu
Copy link
Contributor

@danpovey aishell results

CER Kaldi5.5 Kaldi10
mono 36.41 35.79
tri1 18.76 18.83
tri2 18.64 18.57
tri3a 17.04 16.88
tri4a 13.82 13.49
tri5a 12.12 12.05
nnet3/tdnn_sp 8.65 8.54
chain/tdnn_sp 7.48 7.38

@danpovey
Copy link
Contributor Author

danpovey commented Jul 22, 2019 via email

danpovey added 9 commits July 22, 2019 10:49
* [src] Fixes RE unusual topologies

* [src] Some fixes RE unusual topologies.
* [src] Fixes RE unusual topologies

* [src] Various feature-extraction changes/simplifications, made while syncing with kaldi10feat python code

* Some preliminary work on attention, saving a checkpoint

* [src] Rewrite to nnet-computation-graph code to fix graph-building bug

* [scripts] Add attention example

* [egs] Add attention example

* [scripts] Add use-relu option in attention.py, not used currently.
@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@danpovey
Copy link
Contributor Author

@jtrmal do you know anything about this stale-bot?

@stale stale bot removed the stale Stale bot on the loose label Jun 19, 2020
@danpovey
Copy link
Contributor Author

Oh, Meixu @songmeixu set it up...

@jtrmal
Copy link
Contributor

jtrmal commented Jun 19, 2020 via email

@danpovey
Copy link
Contributor Author

danpovey commented Jun 19, 2020 via email

@jtrmal
Copy link
Contributor

jtrmal commented Jun 19, 2020 via email

@kkm000
Copy link
Contributor

kkm000 commented Jun 20, 2020 via email

@danpovey
Copy link
Contributor Author

danpovey commented Jun 20, 2020 via email

@jtrmal
Copy link
Contributor

jtrmal commented Jun 20, 2020 via email

@kkm000
Copy link
Contributor

kkm000 commented Jun 20, 2020 via email

@kkm000 kkm000 added the stale-exclude Stale bot ignore this issue label Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale-exclude Stale bot ignore this issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.