Skip to content

Conversation

@s-light
Copy link
Contributor

@s-light s-light commented Apr 11, 2018

this pull request adds support for the TI TLC5971 12Ch 16Bit LED-Driver Chips in the SPI-Plugin.

its basically working -
as fare as i have seen if a configuration works it will on every start.
but sometimes / some configurations leads to random malloc(): memory corruption and Received Segmentation fault crashes.
That is not good 🪲
currently i think it has to do with amount of pixels/ports in use - but have no evident for this yet.

iam currently out of ideas how to start tracking down the course of this -
and hope one of you can give me a idea how to start debugging this. -
it definite has to do somehow with my code... and only shows if i use the plugin with the new pixel type.

Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial review comments

// calculate DMX-start-address
const unsigned int first_slot = m_start_address - 1; // 0 offset

// calculate how much channels for full devices are available in dmx_buffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPaG: How many


personalities.insert(personalities.begin() + PERS_TLC5971_INDIVIDUAL - 1,
Personality(m_pixel_count * TLC5971_SLOTS_PER_DEVICE,
"TLC5971 Individual Control (16bit per channel)"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably ideally want both 8 and 16bit individual and combined options eventually, but feel free to fix the underlying issues first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - that is the plan when the rest works :-)
for the combined options there are more then one way to solve it...

  1. repeat/copy one set of driver channels (24ch) to all others
  2. repeat/copy the first 3 LED values (3 or 6 ch for 8 or 16bit modes) to all other positions

what do you think does make more sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's a 4 channel RGB driver right, with global dimmer or similar over each RGB group? You could also use it as 3 channel RGBA drivers (although the global dimmer wouldn't align).

I suspect the latter option is likely to be a better fit for most people, but it's kind of hard to tell.

The main solution would be to implement http://rdm.openlighting.org/pid/display?manufacturer=31344&pid=32773 and then personalities can be independent of driver type and just offer a range of sensible ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - its meant as 4xRGB - i think its not really global dimmer- is more a correction value per color group... (but it is a long time since i actually read the datasheet / wrote this code... - eventually there is both.. a correction and a color-group dimming)
as fare as i know all libraries (i have found) does not let you control any of the 'advanced' features..
for the PIXEL_TYPE thing i think that is handled in #871
do you mean it makes more sens i try and do this first? (iam currently unaware of how much work / where to start for this - but i can read on this...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think #871 should be masses of work @s-light , it should essentially be just tracking another variable and then using both to work out what function to run, rather than just personality. We probably also need to double check the RDM spec, but I think in theory every fixture should offer all personality sets, but perhaps just NAck the irrelevant ones (like a 24 channel clone on a normal RGB WS2801 or whatever). I suspect it broadly makes more sense to do it first, although I guess the bulk of the code to write is in the functions that actually process the DMX, so perhaps it doesn't make that much difference overall.

We should probably try and add the PIXEL_TYPE PIDs to the web UI, which will be interesting as the first manufacturer specific ones, but I can probably handle that bit. As well as that stuff needing to go in the config file.

// Device ..
// Device 2
// Device 1
// short brake of 8x period of clock (666ns .. 2.74ms) to generate latchpulse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPaG break

const unsigned int first_slot = m_start_address - 1; // 0 offset

// calculate how much channels for full devices are available in dmx_buffer
uint16_t devices_in_buffer =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be a uint8_t, or at least the value will always be < 255, or does it need to be a uint16_t as that's what buffer.Size() returns?

return;
}

// rename m_pxiel_count for easier understanding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPaG pixel

PACK(
struct TLC5971_packet_config_fields_t{
// Write Command (6Bit)
uint8_t WRCMD : 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I've learnt some new syntax here. 😄


union TLC5971_packet_gsdata_t {
uint8_t bytes[24];
// the uint16_t will not work everywhere because of endianess problems..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPaG: endianness

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some use of our Host to Network code should fix that issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will have a look at this..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peternewman
Copy link
Member

Can you give us some working and broken configurations to compare please?

@s-light
Copy link
Contributor Author

s-light commented Apr 12, 2018

thanks for your feedback!

i test some configurations for comparison this evening and write up what i find.
last time i just had to enable 4 ports
(all @ 16 'pixels' --> every pixel in this type has 12 LEDs @ 16bit - so 24Channels as that is the smallest amount of data i need to generate a package for one driver chip..)
and it crashed - - 1, 2, 3 ports worked fine before i tested 4, - all with the same 16 pixels... - but if i after a crash went back to 3 or 2 it also crashed.. only 1 port worked then...
so i want to make sure i try and test this a bit more structured (makes a restart of the system a difference for example...)

@s-light
Copy link
Contributor Author

s-light commented Apr 12, 2018

Here my tests / configurations:
as basic i used these set of files:
LEDBoard_Layout_Sun/sw/ETH_SPI_bridge/ola_config/target_config

i only changed things in ola-spi.conf

TLDR result

As far as i can tell its related to the pixel-count value only.
There are these options for the values:

type pixel-count
works 20, 21
crashes on normal exit 9, 10
crashes after a short time 8, 12
crashes immediately 1, 2, 3, 4, 5, 6, 7, 11, 13, 15, 16, , 17, 18, 19

(more than 21 is not possible because of the universe limit: 21*24=504)

so much for tonight. i think i try to read through the code tomorrow once more and hope i find something that looks wired to me...


details

(i wrote it while testing.. to get it documented)


1. try the config as is - crash

*** Error in '/usr/local/bin/olad': double free or corruption (!prev): 0x0009ebb0 ***
Abgebrochen

2. enable only 1 port with pixel-count 20 - works

(only posting here what i have changed / is relevant..)

spidev32766.0-ports = 1
spidev32766.0-0-pixel-count = 20

3. changing pixel-count to 21 - works

spidev32766.0-0-pixel-count = 21

4. change pixel-count to 1 - crashes

spidev32766.0-0-pixel-count = 1

error message is

*** Error in '/usr/local/bin/olad': malloc(): smallbin double linked list corrupted: 0x000f1e38 ***
Abgebrochen

6. changing pixel-count to 20 and port count to 2 - works

spidev32766.0-ports = 2
spidev32766.0-0-pixel-count = 20
spidev32766.0-1-pixel-count = 20

7. changing all pixel-count to 20 and port count to 12 - does not crash

spidev32766.0-ports = 12
spidev32766.0-*-pixel-count = 20

error

plugins/spi/SPIWriter.cpp:119: Failed to write all the SPI data: Message too long

8. changing all pixel-count to 20 and port count to 11 7 - works

spidev32766.0-ports = 7
spidev32766.0-*-pixel-count = 20

9. enable only 1 port with pixel-count 2 - crash

spidev32766.0-ports = 1
spidev32766.0-0-pixel-count = 2

error

Received Segmentation fault
^Ccommon/thread/SignalThread.cpp:115: Received signal: Interrupt
Getötet

10. pixel-count 3..8

pixel-count error message
3 Speicherzugriffsfehler
4 *** Error in '/usr/local/bin/olad': malloc(): smallbin double linked list corrupted: 0x00094f70 *** Abgebrochen
5 *** Error in '/usr/local/bin/olad': malloc(): smallbin double linked list corrupted: 0x000851f8 *** Abgebrochen
6 *** Error in '/usr/local/bin/olad': malloc(): smallbin double linked list corrupted: 0x0008a738 *** Abgebrochen
7 *** Error in '/usr/local/bin/olad': malloc(): smallbin double linked list corrupted: 0x0009db50 *** Abgebrochen
8 worked for a short moment *** Error in '/usr/local/bin/olad': corrupted double-linked list: 0x00051160 *** Abgebrochen

11. pixel-count 9 - works - kind of

works as long as its running - at the moment i hit Ctrl+C for exit i get

*** Error in '/usr/local/bin/olad': double free or corruption (!prev): 0x0009af60 ***
Abgebrochen

12. pixel-count 10 - works - kind of

similar to 9 - but at one occurrence i got an Segmentation fault and after this had to kill it manually...

olad/AvahiDiscoveryAgent.cpp:236: State for OLA Server._http._tcp,_ola, group 0xead00 changed to AVAHI_ENTRY_GROUP_ESTABLISHED
**^C**common/thread/SignalThread.cpp:115: Received signal: Interrupt
common/http/HTTPServer.cpp:537: Notifying HTTP server thread to stop
common/http/HTTPServer.cpp:539: Waiting for HTTP server thread to exit
common/http/HTTPServer.cpp:541: HTTP server thread exited
Received Segmentation fault
Getötet

13. pixel-count 11..19

pixel-count error message
11 *** Error in '/usr/local/bin/olad': corrupted double-linked list: 0x000ecdf8 *** Abgebrochen
12 worked for a moment *** Error in '/usr/local/bin/olad': corrupted double-linked list: 0x000a5280 *** Abgebrochen
13 *** Error in '/usr/local/bin/olad': corrupted double-linked list: 0x000a5280 *** Abgebrochen
14 worked kind of - like 9
15 *** Error in '/usr/local/bin/olad': malloc(): smallbin double linked list corrupted: 0x0009ef28 *** Abgebrochen
16 crash like 10 - but had to kill it every time i tried
17 crash like 16
18 crash like 16
19 *** Error in '/usr/local/bin/olad': malloc(): memory corruption: 0x0008ae48 *** Abgebrochen

@peternewman
Copy link
Member

Possibly relevant:
https://stackoverflow.com/questions/19534051/glibc-detect-smallbin-linked-list-corrupted?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

I'd look at what you checkout from the SPI buffer, and how you access that afterwards.

On a different note, if you update your branch compared to master, the Travis build should start working again.

Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few random thoughts on what might be breaking it.

// should return 28byte = 224bit

// copy data to output buffer
// memcpy(output + spi_offset, device_data.bytes, sizeof(TLC5971_packet_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the memcpy not work? Why not, that ought to be quicker and safer.

} // for devices_in_buffer end

// write output back
m_backend->Commit(m_output_number);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low tech debug, add a log line before this, is it this causing the issue, or the code above.

@peternewman peternewman added this to the 0.11.0 milestone May 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants