Increase node limits for ESP32 nodes with PSRAM#8097
Increase node limits for ESP32 nodes with PSRAM#8097h3lix1 wants to merge 13 commits intomeshtastic:developfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a hot/cold memory split architecture for ESP32-S3 devices to support tracking up to 800 nodes by moving NodeInfoLite payloads to PSRAM while keeping critical routing data in DRAM.
- Implements custom PSRAM allocator for ESP32-S3 that stores full NodeInfoLite objects (196B each) in external memory
- Creates NodeHotEntry cache in DRAM containing only essential fields (20B per node) for fast access during routing operations
- Widens counter types from uint8_t to uint16_t to handle node counts beyond 255
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/modules/AdminModule.cpp | Updates favorite node operations to use new NodeDB API instead of direct field access |
| src/mesh/mesh-pb-constants.h | Changes MAX_NUM_NODES calculation to prioritize PSRAM size over flash size for ESP32-S3 |
| src/mesh/ProtobufModule.h | Widens numOnlineNodes counter from uint8_t to uint16_t |
| src/mesh/NodeDB.h | Adds PSRAM allocator, NodeHotEntry structure, and hot/cold cache management methods |
| src/mesh/NodeDB.cpp | Implements complete hot/cold split logic with cache synchronization and PSRAM-aware operations |
| src/graphics/niche/InkHUD/Applets/Bases/Map/MapApplet.cpp | Changes loop variables from uint8_t to size_t for handling larger node counts |
| src/NodeStatus.h | Widens all node counter types from uint8_t to uint16_t |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Is this extensible to the extra qspi flash on the xiao NRF52? |
|
@NomDeTom I'm not sure we want to slow the nrf52 down any more than it is already. |
I was just thinking of things like nodeDB rolling attacks could be resisted more easily by increasing the size. I'm not sure this would slow it down particularly. |
|
I currently lack the skill required to make this work for nrf52. Placing in draft for now until someone more talented than I am can make this work. |
|
I currently have a $50 bounty out for anybody better than me who can do this for nrf52 nodes with flash. In the meantime, can we get this in at least for the ESP32s out there? |
|
Can't rush this in, 200 nodes is already problematic and entirely vibe coded solutions are generally buggy, get some people testing builds for this. |
|
More testing complete on this MR and changes since the first revision
Comparing to the development branch, even with 3000 nodes, it is saving about 16% heap memory. All memory is allocated ahead of time.
Testing on a production router has proven successful with no reboots and 661 nodes currently. So far this change has been tested successfully on the following platforms: With 2MB of PSRAM this will use 37%. With 8MB of most ESP32-S3 nodes. Next is to move PacketRecord to PSRAM for a savings of about 120KB with NUM_MAX_NODES == 3000, making the ring 6000 entries. For now it fits in DRAM. Moving back to draft for now, but this is looking very good. |
|
200 nodes is slow over WiFi |
|
@garthvh For me it's very fast with the Lora V4 and Xiao Wio. Bluetooth is a different beast and can take a few minutes to get 400 nodes. It seems like a lazy population might be better for bluetooth nodes if the client side can support that. (multiple queues possibly?) . I guess we can just limit this to wifi enabled nodes, or only send the most recently heard 100 nodes if on bluetooth to the client. |
|
Needs to be compatible with the 90% of people using Bluetooth, TCP is also pretty slow 800-3000 nodes seems really optimistic in real world use. |
|
This MR solves the problem of not having a large enough nodedb. I don't see 3000 nodes as being too much of a problem as the memory is all pre-allocated and leaves enough for everything else, but doing the large dump of the DB when connecting over bluetooth is a problem. I'm not sure why you're finding the wifi to be slow, as I can download the DB very quickly, but maybe my wifi is special. In my previous message I'm trying to provide solutions to the large DB problem. We can have the node download the last 100 heard, and then do a fair share queue between node info updates and other incoming updates. Other options include doing a comparison of blocks of node IDs and share the ones that are missing. Any other thoughts? The bay mesh currently cycles through 358 nodes every 3 hours, 500 every day, and currently up to 716 total over the last 8 days. I am guessing this will be towards 900 or 1000 at the end of this year, 2k at the end of next year. Add in some events, and 3k doesn't seem unreasonable as a goal to expand to. I like this change, and I think it is the absolute best way to increase node counts for nodes with PSRAM while decreasing heap utilization. The problem is the communication between phone and mesh device also needs a refresh to support large data dumps. I see this as the beginning of downloading much larger objects over time without needing a phone always attached. This, plus the ability for reliable message delivery, makes for the ability to transfer images, or other binary data, without impacting realtime communications. Large NodeDB just happens to be the first use-case that requires some kind of fair share mechanism. |
|
@garthvh I thought there were recent optimisations to the app code, to bring the nodeDB over after initial handshake? If this is a way to slow the nodeDB rolling in a big mesh, it seems useful, if not advisable. |
|
It was removed from Android because of issues with the legacy connection process and is newish on iOS. This needs to be isolated to infrastructure roles initially and 3000 is just too high. What is the problem being solved here? For the client apps this creates a ton of issues to manage. |
|
@h3lix1 I think what I would be interested in is how to gate this to infrastructure only roles like Router / Router Late, since those are not accessed as much on client apps, which as was pointed out becomes a headache on initial connection. |
|
@thebentern I don't think restricting this to router roles makes much sense, as the benefits for clients are great as well. As mentioned above, bay mesh nodedb expires clients faster than the 3 hour default nodeinfo cycle, causing issues for next-hop routing, keys, local node info, etc. This is also rather bad for MUI devices since more nodes will start showing up as unknown that should be known. Unless we plan to move how things like encryption and keys stored/retired, a bad actor can inject the wrong key for any random user simply because the user rolled off the DB. I don't know how to do this for the nrf52 nodes using a flash-based database. I'm willing to give anybody a bounty to get this working for those nodes as well as it's important, I just don't know how to handle all the gotchas. I.e. Flash corruption, write wear leveling, node performance, etc. If the issue is the initial connection, let's find a way to support lazy loading or simply limit the DB dump size. (Or give clients the option how many to initially load?) But I feel this is very needed for client and router roles alike. If nothing else, it saves a significant amount of heap, even with 3000 nodes. |
|
With the latest bluetooth enhancements, NodeDB downloads much faster. Depending on the node, it downloads 200 nodes in 6-7 seconds in 2.7.13, compared to 18 seconds for 2.7.11. |
8807abf to
26daa3d
Compare
f16616e to
ebfc9e5
Compare
src/libtinylsm/README.md
Outdated
| @@ -0,0 +1,681 @@ | |||
| # Tiny-LSM for Meshtastic NodeDB | |||
There was a problem hiding this comment.
Is there an existing LSM library we can use instead?
This adds a huge amount of code that we would have to maintain.
|
As expected, trying to get this to work with nrf52 is a massive lesson in frustration. The diff 26daa3d works well (great, even) for ESP32 without having to deal with trying to write a database for for a platform that can't handle it. Revertring this back to esp32-specific with psram and maybe someone will want it. |
5d01831 to
26daa3d
Compare
191de55 to
e98b93e
Compare
Add a new compile-time flag HAS_PSRAM_NODEDB that allows each variant to enable or disable the PSRAM-backed NodeDB feature independently. This is useful for ESP32-S3 boards with TFT displays or other PSRAM-heavy features that may not have enough PSRAM headroom for the 3000-node database. Default behavior: - ESP32-S3 with BOARD_HAS_PSRAM: Enabled (HAS_PSRAM_NODEDB=1) - All other platforms: Disabled (HAS_PSRAM_NODEDB=0) Variants can override by adding to variant.h: - #define HAS_PSRAM_NODEDB 0 // Disable for TFT variants - #define HAS_PSRAM_NODEDB 1 // Force enable Example configurations added to t-deck and heltec_v4 variant.h files.
617e28b to
f8a969f
Compare
I have tested this over the last month on routers and client devices alike.
There is a bubble sort used for
nodedb. It's completing normally within 3–4 ms, but sometimes jumps to 11 ms. This seems OK, but willing to accept advice here.Node Hot/Cold Split
meshtastic_NodeInfoLitepayload in PSRAM using a custom allocator that callsheap_caps_malloc(MALLOC_CAP_SPIRAM | MALLOC_CAP_8BIT)(src/mesh/NodeDB.h:20,src/mesh/NodeDB.cpp:73).NodeHotEntrycache (~20 B per node:num,last_heard,snr,channel/flags) alongside dirtiness bits for sync-on-demand (src/mesh/NodeDB.h:33,src/mesh/NodeDB.cpp:78).Memory Footprint per Node (bytes)
Capacity & Secondary Effects:
MAX_NUM_NODESis a max50003000nodes as long as psram size is> 2, otherwise the old flash-based limits apply (src/mesh/mesh-pb-constants.h:54).max(MAX_NUM_NODES*2, …)entries (src/mesh/PacketHistory.cpp:11), so doubling the node ceiling means the history structure grows accordingly—keep an eye on overall PSRAM consumption if future caps rise again.Serialization & Cold Access
NodeDBsave/load moves through PSRAM: hot nodes are copied into a temporary vector before protobuf encoding, then cleared back out after disk writes (src/mesh/NodeDB.cpp:1322,src/mesh/NodeDB.cpp:1414).Runtime Behavior
getMeshNodeChannel,set_favorite, online counts, packet next-hop updates (src/mesh/NodeDB.cpp:1750,src/mesh/NodeDB.cpp:1939,src/mesh/NodeDB.cpp:2156).NodeInfodumps to the phone, detail panels that copy cold payloads, database saves—each now copies between DRAM and PSRAM but only on demand.Large Mesh Readiness
src/NodeStatus.h:16,src/mesh/ProtobufModule.h:16).size_t, so they handle the full PSRAM-backed node list without truncation (src/graphics/niche/InkHUD/Applets/Bases/Map/MapApplet.cpp:156).Also recently added
has_psram()so we only scale up when ≥ 2 MB of PSRAM is available.MeshPacketpool into a PSRAM-backed allocator on ESP32-S3; if allocation fails we fall back to heap so radios keep working.NodeDBsizing, so nodes stay capped on low-memory boards without moreESP.getPsramSize()calls.🤝 Attestations
Devices tested
(Other — please specify below)