Skip to content

modules/zstd: Add support for decoding compressed blocks (DSLX part) #2230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

wsipak
Copy link
Contributor

@wsipak wsipak commented May 21, 2025

This PR adds the DSLX parts of #1857 in order not to depend on cocotb-related python packages
supersedes #1857

This PR extends the ZstdDecoder with support for decoding compressed blocks.

The decoder is capable of decoding RAW and RLE literals as well as sequences with predefined FSE tables.
A suite of DSLX tests comprising unit tests of all underlying procs and an integration test was prepared.
The integration test, similarly as in #1654, first generates a random valid ZSTD frame with compressed blocks and expected decoded output. Test data is then converted to a DSLX file (example) that is imported by the integration tests file.
At the beginning of the test, the default FSE decoding tables are filled with default distributions taken from RFC 8878 section 3.1.1.3.2.2. Default Distributions . Next, the encoded frame is loaded to the system memory and the decoder is configured through a set of CSRs to start the decoding process. The decoder starts the operation and writes the decoded frame back into the output buffer in the system memory. Once it finishes, it sends a pulse on the notify channel signaling the end of the decoding. The output of the decoder is compared against the decoding result from the reference library.

The PR introduces among others:

  • SequenceDecoder - responsible for decoding sequence sections of the compressed blocks
  • FseDecoder - introduced as the core part of the SequenceDecoder
  • RefillingShiftBuffer - used for storing and outputting in forward and backward fashion an arbitrary amount of bits required by the FSE decoder
  • LiteralsDecoder - capable of decoding RAW, RLE and Huffman-coded literals
  • HuffmanDecoder - used in decoding huffman-coded literals. Decoded Huffman trees are then used to decode one or four Huffman-coded streams.
  • CommandConstructor - this proc is responsible for sending packets with decoded sequences and literals to the SequenceExecutor proc
  • RamMux and RamDemux - procs used for handling requests/responses to multiple memory models. The procs interface with 3 separate memory buffers for FSE decoding tables.

@proppy
Copy link
Member

proppy commented May 22, 2025

The PNR step of the action has been running for 7+ hours: https://github.com/google/xls/actions/runs/15171186417/job/42667677899?pr=2230 is that expected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we're removin this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We modified the interface of the part responsible for decoding fame header, and it was easier for us to extend the DSLX tests. This also made the tests consistent with the rest of the modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that obsolete now w/ the new implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part responsible for checking the magic number has been moved to FrameHeaderDecoder:
https://github.com/google/xls/pull/2230/files#diff-09700fe24633d73f1e4217266bae21f08269db322cc896aec4b946ecf9842cc3R428

static const uint32_t random_frames_count = 100;
};

// Test `random_frames_count` instances of randomly generated valid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if that would have done using fuzzing instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried using fuzzing, but the problem is that the control sections need to follow a specific structure and match the properties of the generated data. It might be possible to use fuzzing with some constraints on the frame, but there are a lot of rules to follow, which makes it complicated. In the end, it was easier to use a tool made for creating random test data that meets these requirements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make all the verilog .sv files?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we bump the timeout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After bumping the PR we noticed that some targets become longer then expected.
We are working on fixing that

@proppy
Copy link
Member

proppy commented May 23, 2025

The PNR step of the action has been running for 7+ hours: https://github.com/google/xls/actions/runs/15171186417/job/42667677899?pr=2230 is that expected?

@QuantamHD @mikesinouye is a 17h+ runtime expected for PnR w/ a block of this size, or are there some tweak we should do to the config to speed things up?

Copy link
Member

@proppy proppy May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add:

# pytype binary, library"

near the other load statements that will help resolve the internal version of the python rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mikesinouye
Copy link
Contributor

The PNR step of the action has been running for 7+ hours: https://github.com/google/xls/actions/runs/15171186417/job/42667677899?pr=2230 is that expected?

@QuantamHD @mikesinouye is a 17h+ runtime expected for PnR w/ a block of this size, or are there some tweak we should do to the config to speed things up?

I notice that the target_die_utilization_percentage is quite low on many of these rules (5, 10%). Core area size can unexpectedly scale with runtime even though the 'complexity' of the problem is easier with more free space. I know this at least affects gpl, cts is possible as well.

We've done a lot of work to speed things up since the current rules_hdl version - I think this would be worth investigation after bumping again. We're still planning to do this soon. Everything but Qt is building using bazel in upstream OpenROAD, so we're almost there.

@proppy
Copy link
Member

proppy commented May 29, 2025

Looks like there is an issue w/ executing some of DSLX tests:

F0529 13:44:40.943623   21896 ast_utils.cc:568] Check failed: new_parent != nullptr node `ram::WriteReq<RAM_ADDR_WIDTH, WeightPreScanMetaDataSize(), u32:1>` had no function parent (`#[test_proc]
proc Prescan_test {
    type external_ram_addr = uN[RAM_ADDR_WIDTH];
    type external_ram_data = uN[RAM_ACCESS_WIDTH];
    type PrescanOut = WeightPreScanOutput;
    type ReadReq = ram::ReadReq<RAM_ADDR_WIDTH, RAM_NUM_PARTITIONS>;
    type ReadResp = ram::ReadResp<RAM_ACCESS_WIDTH>;
    type WriteReq = ram::WriteReq<RAM_ADDR_WIDTH, RAM_ACCESS_WIDTH, RAM_NUM_PARTITIONS>;
    type WriteResp = ram::WriteResp;
    type InternalReadReq = ram::ReadReq<RAM_ADDR_WIDTH, u32:1>;
    type InternalReadResp = ram::ReadResp<WeightPreScanMetaDataSize()>;
    type InternalWriteReq = ram::WriteReq<RAM_ADDR_WIDTH, WeightPreScanMetaDataSize(), u32:1>;
    type InternalWriteResp = ram::WriteResp;
    terminator: chan<bool> out;
    external_ram_req: chan<WriteReq> out;
    external_ram_resp: chan<WriteResp> in;
    start_prescan: chan<bool> out;
    prescan_response: chan<PrescanOut> in;
    config(terminator: chan<bool> out) {
        let (RAMExternalWriteReq_s, RAMExternalWriteReq_r) = chan<WriteReq>("Write_channel_req");
        let (RAMExternalWriteResp_s, RAMExternalWriteResp_r) = chan<WriteResp>("Write_channel_resp");
        let (RAMExternalReadReq_s, RAMExternalReadReq_r) = chan<ReadReq>("Read_channel_req");
        let (RAMExternalReadResp_s, RAMExternalReadResp_r) = chan<ReadResp>("Read_channel_resp");
        spawn ram::RamModel<RAM_ACCESS_WIDTH, RAM_SIZE, RAM_PARTITION_SIZE>(RAMExternalReadReq_r, RAMExternalReadResp_s, RAMExternalWriteReq_r, RAMExternalWriteResp_s);
        let (RAMInternalWriteReq_s, RAMInternalWriteReq_r) = chan<InternalWriteReq>("Internal_write_channel_req");
        let (RAMInternalWriteResp_s, RAMInternalWriteResp_r) = chan<InternalWriteResp>("Internal_write_channel_resp");
        let (RAMInternalReadReq_s, RAMInternalReadReq_r) = chan<InternalReadReq>("Internal_read_channel_req");
        let (RAMInternalReadResp_s, RAMInternalReadResp_r) = chan<InternalReadResp>("Internal_read_channel_resp");
        spawn ram::RamModel<WeightPreScanMetaDataSize(), RAM_SIZE, WeightPreScanMetaDataSize()>(RAMInternalReadReq_r, RAMInternalReadResp_s, RAMInternalWriteReq_r, RAMInternalWriteResp_s);
        let (PreScanStart_s, PreScanStart_r) = chan<bool>("Start_prescan");
        let (PreScanResponse_s, PreScanResponse_r) = chan<PrescanOut>("Start_prescan");
        spawn WeightPreScan(PreScanStart_r, RAMExternalReadReq_s, RAMExternalReadResp_r, PreScanResponse_s, RAMInternalReadReq_s, RAMInternalReadResp_r, RAMInternalWriteReq_s, RAMInternalWriteResp_r);
        (terminator, RAMExternalWriteReq_s, RAMExternalWriteResp_r, PreScanStart_s, PreScanResponse_r)
    }
    init {
        ()
    }
    next(state: ()) {
        let tok = join();
        let rand_state = random::rng_new(random::rng_deterministic_seed());
        for (i, rand_state) in range(u32:0, MAX_SYMBOL_COUNT / PARALLEL_ACCESS_WIDTH) {
            let (new_rand_state, data_to_send) = for (j, (rand_state, data_to_send)) in range(u32:0, PARALLEL_ACCESS_WIDTH) {
    @     0x55f[9](https://github.com/google/xls/actions/runs/15324230058/job/43117399670?pr=2230#step:8:10)fae093fc  xls::dslx::DeduceCtx::DeduceAndResolve()
    @     0x55f9fac965a8  xls::dslx::(anonymous namespace)::DeduceVisitor::HandleLet()
    @     0x55f9fc198adb  xls::dslx::Let::Accept()
    @     0x55f9fac841e1  xls::dslx::Deduce()
    @     0x55f9fac80d45  std::__1::__function::__func<>::operator()()
    @     0x55f9fae085bc  xls::dslx::DeduceCtx::Deduce()
    @     0x55f9fac8c674  xls::dslx::(anonymous namespace)::DeduceVisitor::HandleStatement()
    @     0x55f9fc19893b  xls::dslx::Statement::Accept()
    @     0x55f9fac841e1  xls::dslx::Deduce()
    @     0x55f9fac80d45  std::__1::__function::__func<>::operator()()
    @     0x55f9fae085bc  xls::dslx::DeduceCtx::Deduce()
    @     0x55f9fac92b56  xls::dslx::(anonymous namespace)::DeduceVisitor::HandleStatementBlock()
    @     0x55f9fc1981cb  xls::dslx::StatementBlock::Accept()
    @     0x55f9fac841e1  xls::dslx::Deduce()
    @     0x55f9fac80d45  std::__1::__function::__func<>::operator()()
    @     0x55f9fae085bc  xls::dslx::DeduceCtx::Deduce()
    @     0x55f9fae093fc  xls::dslx::DeduceCtx::DeduceAndResolve()
    @     0x55f9fac81842  xls::dslx::TypecheckFunction()
    @     0x55f9fac7da70  std::__1::__variant_detail::__visitation::__base::__dispatcher<>::__dispatch[abi:ne180[10](https://github.com/google/xls/actions/runs/15324230058/job/43117399670?pr=2230#step:8:11)0]<>()
    @     0x55f9fac7b4ac  xls::dslx::typecheck_internal::TypecheckModuleMember()
    @     0x55f9fac7c0bf  xls::dslx::TypecheckModule()
    @     0x55f9fac7a9d6  xls::dslx::TypecheckModule()
    @     0x55f9fac79fbd  xls::dslx::ParseAndTypecheck()
    @     0x55f9fabe7628  xls::dslx::AbstractTestRunner::ParseAndTest()
    @     0x55f9fab75ca0  main
    @     0x7fe290a2a1ca  (unknown)
    @     0x7fe290a2a28b  __libc_start_main
/bin/bash: line 2: 2[18](https://github.com/google/xls/actions/runs/15324230058/job/43117399670?pr=2230#step:8:19)96 Aborted                 (core dumped) bazel-out/k8-opt-exec-ST-d57f47055a04/bin/xls/dslx/interpreter_main $file --compare=none --execute=false --dslx_path=:${PWD}:bazel-out/k8-opt/bin:bazel-out/k8-opt/bin::bazel-out/k8-opt/bin/ --warnings_as_errors=false
Error parsing and type checking DSLX source file: xls/modules/zstd/huffman_prescan.x

lpawelcz and others added 11 commits May 30, 2025 10:03
Remove references to buffer structs as those are not used anywhere

Signed-off-by: Pawel Czarnecki <[email protected]>
Co-authored-by: Pawel Czarnecki <[email protected]>
Co-authored-by: Robert Winkler <[email protected]>
Signed-off-by: Maciej Torhan <[email protected]>
Signed-off-by: Pawel Czarnecki <[email protected]>
Signed-off-by: Robert Winkler <[email protected]>
Signed-off-by: Krzysztof Oblonczek <[email protected]>
rw1nkler and others added 18 commits May 30, 2025 10:03
Signed-off-by: Robert Winkler <[email protected]>
Signed-off-by: Pawel Czarnecki <[email protected]>
* Use materialize_internal_fifos when possible
* Disable the above option for procs with loopback channels
* Add missing module names
* Add xls_fifo_wrapper verilog dependency to the synthesis of the procs without materialized internal fifos

Signed-off-by: Pawel Czarnecki <[email protected]>
Co-authored-by: Wojciech Sipak <[email protected]>
No longer applicable - there are no CC tests in ZSTD module

Signed-off-by: Pawel Czarnecki <[email protected]>
Signed-off-by: Wojciech Sipak <[email protected]>
Improve formatting, wording and fix lint issues

Co-authored-by: Dominik Lau <[email protected]>
Co-authored-by: Szymon Gizler <[email protected]>
Signed-off-by: Wojciech Sipak <[email protected]>
Co-authored-by: Pawel Czarnecki <[email protected]>
Co-authored-by: Krzysztof Obłonczek <[email protected]>
Signed-off-by: Robert Winkler <[email protected]>
@wsipak wsipak force-pushed the zstd_compressed_block_dec_dslx_part branch from a6afa36 to a8497ec Compare May 30, 2025 11:51
Let's use predefined data first and only then add
zstd as dependency.

Signed-off-by: Wojciech Sipak <[email protected]>
@wsipak wsipak force-pushed the zstd_compressed_block_dec_dslx_part branch 3 times, most recently from fc26a47 to aab0556 Compare May 30, 2025 12:28
@mgielda mgielda force-pushed the zstd_compressed_block_dec_dslx_part branch from aab0556 to 8f93fcc Compare May 30, 2025 13:03
@rw1nkler rw1nkler force-pushed the zstd_compressed_block_dec_dslx_part branch from aab0556 to 54b5d01 Compare May 30, 2025 13:06
@wsipak
Copy link
Contributor Author

wsipak commented May 30, 2025

It seems like GitHub complains about me not being CLA approved as a PR opener (even though I have been covered by the Google CLA as a commit author since ages). This may be due to some (new?) email privacy settings in GH which I now disabled but the PR was issued prior to that. To check if that was indeed the problem, I'm closing this PR and a follow up PR should be opened shortly. Sorry for the inconvenience, but it seems like a GH limitation.

@wsipak wsipak closed this May 30, 2025
@rw1nkler
Copy link
Contributor

Moved to #2296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants