Difference between revisions of "DAQ Testing/20181217"
Line 88: | Line 88: | ||
** Reboot to see if that fixes - it does | ** Reboot to see if that fixes - it does | ||
* Run 4795 - Retry 4794 after rebooting | * Run 4795 - Retry 4794 after rebooting | ||
− | ** Works, no helicity errors, | + | ** Works, no helicity errors, I don't remember if the online analyzer is missing events (I didn't write it down), but if it is then that comes from the bridge not transferring fast enough, which I would expect to cause the 32 bit ET station's input buffer to fill up or back up the DAQ, but that doesn't happen, at least it doesn't cause helicity errors in the Tape or vqwk ROC readout (doesn't mean it won't, it just means I didn't see it), and if it is seeing all events then it means blocking mode ET bridge works differently than non-blocking even with small queue and chunk sizes (see next runs) |
− | ** | + | ** If this mode doesn't skip events sending to the analyzer then this means we definitely don't need to worry about throttling through the ET bridge, it will happily send all the events to the analyzer and even though it is in blocking mode won't panic that it isn't doing its job well enough |
− | ** This issue of having a large enough chunk and queue is still a problem for the analyzer getting all the events (at 1 kHz at least), and I still feel a bit funny about this conclusion, but it is safe for now | + | ** The ET bridge not being fast enough at sending definitely appears to be a problem when the ET bridge is in non-blocking mode, see next run... |
+ | ** This issue of having a large enough chunk and queue is probably still a problem for the analyzer not getting all the events (at 1 kHz at least), and I still feel a bit funny about this conclusion, but it is safe for now with just using large enough queue and chunk anyway, and probably we will just not use blocking bridge since it is possibly dangerous anyway and shouldn't be messed around with | ||
=== Non-Blocking Bridge, Blocking Client === | === Non-Blocking Bridge, Blocking Client === | ||
Line 145: | Line 146: | ||
* Run 4812 - Check the injector config too, for completeness | * Run 4812 - Check the injector config too, for completeness | ||
** Works | ** Works | ||
+ | |||
+ | == Results == | ||
+ | |||
+ | My expectations from [[DAQ_Testing/20181205|earlier]] are confirmed. Exceptions: | ||
+ | * I see that the blocking mode ET bridge doesn't actually appear to care if it isn't able to properly transfer all of the data given to its input buffer at a fast enough rate, having something to do with the blocking vs. non-blocking status maybe | ||
+ | ** This insufficient transfer rate issue may be worth revisiting to get a more precise understanding, but I would advocate for not using a blocking ET bridge ever and just sticking to non-Blocking | ||
+ | ** And the reason my previous non-Blocking tests came back with deadtime was because I had a bug in the code that made it blocking always, so this confusion has now been resolved | ||
+ | * I now see that the ROC has its own memory buffer that can store a few thousand events entirely outside of the ET system, meaning that if the ET system gets bogged down and backed all the way up that the ROC will store a few thousand events for us, preventing helicity errors and missed events | ||
+ | ** However it is not exactly clear how large this memory buffer is, and it sometimes skips bunches and stops when it is full | ||
+ | ** And if you get to this point you are badly behind in time, so the online analyzer is way too late | ||
+ | ** And if this happens and the CODA run ends before the ROC releases its stored events, then the next run will get them at the beginning, meaning your data will start already tainted! This is possibly even worse than helicity errors or deadtime or missed online analyzer events (and it explains some of the really bizarre behavior I noticed when I first started doing these tests and is a testament to the usefulness of the et_monitor program) | ||
+ | * The 20 second sleep time seems to be necessary for DAQ resilience, and I don't know how to establish a way of hard-coding a check-for-64-bit-ET logic in there right now, but it should be done soon | ||
+ | |||
+ | == Conclusions == | ||
+ | |||
+ | We should use non-blocking ET bridge and either a blocking or non-blocking ET client station for the online analyzer (may as well go with blocking so it doesn't worry about messing up the event sequence in the relatively hidden 64 bit ET and lets that happen in the CODA monitored 32 bit one) | ||
+ | |||
+ | There is no deadtime in the ROC/DAQ/ADCs as long as the online client analyzer is faster than the event rate (i.e. for blocking ET bridge mode) | ||
+ | |||
+ | There are missed events in the online analyzer only if it is slower than the event rate or if the ET bridge's queue is too short to catch all of the events being thrown at it by the 32 bit ET system (queue of 500 is big enough to survive 1 kHz and should be the default setting) |
Revision as of 19:04, 20 December 2018
Back to Main Page >> DAQ Documentation Portal >> DAQ Testing >> DAQ Commissioning Notes
Previous Day of Testing << >> Next Day of Testing
December 17th, 2018 Testers: Cameron
Goals
- Test ET deadtime with et_monitor C++ executable, perform data flow stress tests at 30 Hz and accidental deadtime from bridge slowness at 1kHz
ET Deadtime
See also ET Bridge overview
This is what I did to perform precise and comprehensive tests of the ET deadtime at 30 Hz and 1 kHz:
- Use 30 Hz injector quartet helicity signals
- Modify default chunk and queue sizes for testing:
- set ~/bin/ETbridge_bgr queue to 1 (-q parameter, was 10,000 from previous tests)
- set ET bridge blocking on (remove -nb parameter)
- set ET bridge chunk size to 100 (default value = 100, with -c parameter)
- set (for these tests only) 32 and 64 bit ET queue (input buffer) lengths down to 30 instead of previously increased to 10,000 value (so they fill in 1 second instead of forever) - set these with -q parameter in ~/bin/startcoda and ~/bin/start2ndET
- Modify client (~/et-12.0/src/cameronc/client_main.C) settings for testing:
- set client into blocking mode by editing THaEtClient.C and changing "NONBLOCKING" to "BLOCKING" in ET station initializtion
- set client chunk size to 1 instead of default 100 value (this makes the client request a fresh set of events once for each event, instead of waiting several seconds and confusing the user/simple online analyzer code Bob wrote)
- Run online analyzer simple client using ./etclient [number of events] [time to "wait"] (and it will print the event number foreach event and complain if event numbers are skipped and calculate relative rates using scalers in the counting house DAQ, as described in prior testing)
- Check on helicity errors using ~/chkdtime/dtchk code on files in /adaq2/data1/apar/parity18_[run number].dat
Blocking Bridge, Blocking Client
- Run 4776 - test the PAN decoder, CODA setup (including vqwk ADCs), and online analyzer client all work
- Client ran at 16ms, which is faster than the 30 Hz flip rate
- Yes they all work, there are no helicity errors in tape data, no missed event numbers in online analyzer, and vqwk ADCs don't register any buffer read/write errors in the ROC during data collection either
- Run 4777 - test with a slow analyzer (analyzer speed < flip rate) - start with 66 ms per event (~2x slower than flip rate)
- There are no helicity errors in tape data, no missed event numbers in online analyzer, and vqwk ADCs don't error during data collection either
- Once the online analyzer client is killed the backed up events flow through the system unperturbed, even though the entire 64 and 32 bit ET systems had been fully backed all the way up
- Run 4778 - Test extremely slow analyzer (0.1 Hz analysis) to see how many events can be stored in extra buffer (The ROC's memory, according to Paul King, should be ~2k events or so)
- This definitely grinds both 64 and 32 bit ET systems to a complete halt
- No events flow through the ET systems or analyzer once all ET buffers are full, but once the analyzer is killed all of the events produced in the ROC come pouring out
- This is good from event 2000 (when analyzer was turned on) to event 7299, where one single ROC telnet connection output vqwk read/write buffer error is registered, and the helicity checker sees that this is exactly where the helicity sequence is violated and helicity errors begin
- The ROC buffer is on the order of 3k events, but further tests will see what is going on
- Run 4779 - Test to see if maybe its actually the client's input buffer (utilized explicitly during non-blocking mode)
- Set line 112 ("cue") of THaEtClient.C code to have 30 events long buffer instead of 10,000 as was previously increased to
- Event 1637 gives the vqwk error (seemingly unrelated to online analyzer client running)
- Hel errors
- Run 4780 - Redo 4779 test
- Made it to 2496 events before vqwk error this time
- All but the first 100 events were backed up into the ROC memory (so it is not the client "cue")
- The "normal" vqwk read/write buffer errors are too common for reliable testing of exactly how large the ROC memory is, but assuming ~2k events is probably safe for now
- Run 4781 - Test et bridge input buffer size and also test to see what the helicity and event numbers and total number of events look like when relying on ROC memory for a long time
- Edit et_bridge code and recompile to have 30 total events instead of 10,000 - open ~/et-12.0/bob/et32_2_et64.c and edit #NUMEVENTS line and recompile with make etbridge and then copy the executable over to ~/bin/etbridge (you may need to make clean and "cp ./orig/libevio.a ." and then remake)
- This still backs up both ETs as before, and ROC 23 still stays active for a few thousand events as before
- Now looking at extreme backed up ROC behavior I see in the rcgui that ROC 23 eventually hits a wall and reads "0" event data taking rate after long enough
- Interestingly once the ~few thousand event ADC read/write buffer error is reported (event 2494) the ROC will resume reading data (as reported in RC gui)
- It resumes and then the RC gui ROC event rate drops to zero again, and the ADC hiccup happens again resuming ROC event reading again (so its a cycle)
- Maybe this has to do with a chunk getting grabbed, or maybe it has to do with the ADCs legitimately having a problem sometimes, and probably related to starting or stopping writing events (the ADCs have the same kind of hiccup when beginning a new run very often, in fact so often that I told the helicity deadtime checking code to skip the first 30 events so that this issue doesn't confound its pattern recognizer)
- Still once you kill the client thousands of events come pouring through the ET system from the ROC
- Events are definitely skipped in these ADC hiccups, and the helicity sequence is broken
Blocking Bridge, Non-Blocking Client
- Run 4782 - Testing blocking bridge and non-blocking client, and also testing if the blocking bridge chunk size affect ADC hiccup
- Set non-blocking client "cue" to 30, but ET bridge #NUMEVENTS back up to 10,000
- This should act like a blocking client station, since the cue length is the same as the total number of events in the 64 bit ET
- In fact it does act like a blocking client station
- Run 4783 - now set 64 bit ET client total number of events to 50 to be larger than client cue for proper non-blocking client station test (64 bit ET chunk still == 100 events here though, not a source of any problem I can see, this chunk is just a max n_events per request, not the minimum as in the case of client station chunk)
- Now grand central in 32 bit and 64 bit ET are both happy and see all of the events, no vqwk hiccups
- The non-blocking client analyzer does get bypassed when the analyzer speed is slower than the helicity flip rate, and this means no deadtime in the DAQ, but missed events in analyzer
- Run 4785 (4784 CODA failed) - see how harmful setting the chunk size of the ET bridge is for the ability of the bridge to transfer data at necessary rates from one ET to the other (still at 30 Hz injector supplied helicity information)
- Set ET Bridge chunk size to 1 (with -c parameter)
- This has no new affect, chunk = 1 is sufficient at 30 Hz (queue is still 30 from earlier)
- But the problem before was actually too small of a queue size (so too many events show up before the analyzer has a chance to use them, causing the queue to fill up and lose events - see prior testing notes)
- Run 4786 - test bridge chunk = 1 still and now bridge queue = 1 too to test transfer rate
- Appears to be fine, 30 Hz is slow enough to not be affected by slow bridge transfer
- Run 4787 - try 1 kHz helicity flip rate from Helicity control board now, using vxworks writeHelBoard command and editing integration time in ~/vxworks/halladaq6.boot script (same bridge chunk and queue = 1 scenario)
- No vqwk read errors and grand central keeps up with 1 kHz, helicity reads fine, but the vqwks are still glitching too much, even at the very beginning, to get a long helicity sequence going (skip to run 4790 for one that does work, with nothing changed to fix it)
- Run 4788 - try again with no analyzer - helicity works
- Run 4789 - try again with slow analyzer - helicity fails
- Run 4790 - try again with slow analyzer - helicity works now (so its just being obnoxious...), but does it just work because my analyzer was so slow that the chunk and queue size in the bridge being so small weren't even relevant for data transfer?
- Run 4791 - Try 1 kHz now with q = 1 and chunk = 10
- Early helicity error, try again later
- Run 4792 - Try again to avoid early helicity error
- Still has early error
- Fix helicity deadtime checking code by making it skip over the first 30 events anyway, so it doesn't matter any more whether the initial few events are good or bad
- Now checking the helicity I see no errors, but it is still the slow analyzer that would avoid q and chunk size problems
- Run 4793 - Now try a fast analyzer with queue = 1 and chunk = 10 in ET bridge
- It works, no helicity errors
- Run 4794 - Try again, but with queue = 1 and chunk = 1 in ET bridge again
- No helicity errors, but there are lots of vqwk errors in the serial port readout terminal (even after killing the client - we need to keep track of these vqwk errors or squash them permanently during the experiment, because this would be a killer probably for collecting proper data)
- Reboot to see if that fixes - it does
- Run 4795 - Retry 4794 after rebooting
- Works, no helicity errors, I don't remember if the online analyzer is missing events (I didn't write it down), but if it is then that comes from the bridge not transferring fast enough, which I would expect to cause the 32 bit ET station's input buffer to fill up or back up the DAQ, but that doesn't happen, at least it doesn't cause helicity errors in the Tape or vqwk ROC readout (doesn't mean it won't, it just means I didn't see it), and if it is seeing all events then it means blocking mode ET bridge works differently than non-blocking even with small queue and chunk sizes (see next runs)
- If this mode doesn't skip events sending to the analyzer then this means we definitely don't need to worry about throttling through the ET bridge, it will happily send all the events to the analyzer and even though it is in blocking mode won't panic that it isn't doing its job well enough
- The ET bridge not being fast enough at sending definitely appears to be a problem when the ET bridge is in non-blocking mode, see next run...
- This issue of having a large enough chunk and queue is probably still a problem for the analyzer not getting all the events (at 1 kHz at least), and I still feel a bit funny about this conclusion, but it is safe for now with just using large enough queue and chunk anyway, and probably we will just not use blocking bridge since it is possibly dangerous anyway and shouldn't be messed around with
Non-Blocking Bridge, Blocking Client
- Run 4797 - Check small ET bridge chunk and queue size first (still 1 kHz)
- This works, and there are no helicity errors, but the data flow into the 64 bit ET system is very slow (to be precise: there are significant chunks of events that are skipped over)
- This was explored in more detail before in runs 4736 and other related tests in prior testing
- Run 4798 - Set 30 Hz from injector helicity, set queue of ET bridge to 30, chunk size to 100, and test a slow analyzer vs. deadtimes again
- Now, with reasonable ET bridge queue lengths and such, but with a slow analyzer, we see skipped chunks of evens in the analyzer, because the bridge input buffer gets filled and then the 32 bit ET system just bypasses the bridge's input buffer so that events are skipped
- There is no deadtime in the 32 bit ET system or Tape or ROC or DAQ
- This is the configuration that I suggest we run the entire experiment in
Non-Blocking Bridge, Non-Blocking Client
- Run 4799 - same configuration of queues and such as 4798, same slow analyzer
- No deadtime in the DAQ anywhere
- The slow analyzer definitely skips events, this time due to the Analyzer client station getting bypassed in the 64 bit ET system
- This configuration will jumble the order of events in exactly the same way as the 4798 configuration
- Except the jumbling is happening in the 64 bit ET system, one step away from the purview of the RC gui and running CODA
- Which in my opinion makes it less controllable/diagnosable without using the et_monitor program during data collection runs
- That sounds tedious and unnecessary to me, so just use the 4798 config instead and make sure there are no sensitive components down stream of the ET bridge that would care about a jumbled data-flow
- You could also use a "if input buffer is 95% full then release all events in this ET bridge or online analyzer input buffer so that the event sequence is not compromised" but that has not been implemented yet
ET Bridge Sleep Time Work Around Test
I don't like waiting 20 seconds each time CODA wants to start a new run to allow the ET bridge to decide if it can safely connect to the 64 bit ET system yet. The reason this is here is because if you initialize the ET bridge in the startcoda script then it will race and sometimes beat the initialization of the 64 bit ET, and if it wins the race it will fail to connect because it can't find /tmp/et_sys_par2. There are several ways to try to avoid this, but I am not an expert, here are my tries with setting the sleep time to 0 in ETbridge_bgr script, etc.:
- Run 4800 - manually hit all buttons in sequence (the ETbridge_bgr script is executed during the go command, so it should really not have an issue finding and connecting to the 2nd ET system if you go slow enough)
- Works
- Run 4801 - now hit restart
- Works
- Run 4802 - hit start, skipping pre-start, from a fresh restarted CODA session
- Event Builder fails to connect to the 2nd ET (this is probably the problem the 20 second sleep was trying to avoid), but is it reproducible?
- Run 4803 - hit start, skipping download, from fresh CODA
- Works
- Run 4804 - repeat 4802
- Works this time, so it isn't guaranteed to fail/it may have been an unrelated CODA issue then
- Run 4805 - test new code at line 345 of ET bridge code, now use "et_wait_for_alive(the id for 64 bit ET)"
- Forgot to compile...
- Prior test works though, repeating 4802 a 3rd time (CODA did die after 1k events though...)
- Run 4806 - Actually do "et_wait_for_alive()"
- Fails, "waiting for EB1" indefinitely, meaning the race condition thing probably still exists
- Run 4807 - redo 4806
- Works this time...
- Run 4808 - Add 60 s sleep to the start2ndET part of startcoda (instead of the ETbridge_bgr command as it was before)
- CODA succeeds, even when there is clearly no existing 64 bit ET system, but now the bridge hangs and doesn't attach to the 32 bit ET system, as expected
- Run 4809 - Change to a while loop (instead of "et_wait_for_alive()", now use "while !et_alive(64 bit ET id)" logic)
- Never quite works
- Run 4810 - same as 4809
- Still not working
- This should be looked into by experts, I think the network connecting nature of ET bridge defies the standard use of the "alive" commands for looking for ET connections (probably only would work for the 32 bit ET system initial connection?). There is probably a network pinging type of command that can be employed.
- Run 4811 - Return everything to normal as it was before testing today, make HALOG
- Works
- Run 4812 - Check the injector config too, for completeness
- Works
Results
My expectations from earlier are confirmed. Exceptions:
- I see that the blocking mode ET bridge doesn't actually appear to care if it isn't able to properly transfer all of the data given to its input buffer at a fast enough rate, having something to do with the blocking vs. non-blocking status maybe
- This insufficient transfer rate issue may be worth revisiting to get a more precise understanding, but I would advocate for not using a blocking ET bridge ever and just sticking to non-Blocking
- And the reason my previous non-Blocking tests came back with deadtime was because I had a bug in the code that made it blocking always, so this confusion has now been resolved
- I now see that the ROC has its own memory buffer that can store a few thousand events entirely outside of the ET system, meaning that if the ET system gets bogged down and backed all the way up that the ROC will store a few thousand events for us, preventing helicity errors and missed events
- However it is not exactly clear how large this memory buffer is, and it sometimes skips bunches and stops when it is full
- And if you get to this point you are badly behind in time, so the online analyzer is way too late
- And if this happens and the CODA run ends before the ROC releases its stored events, then the next run will get them at the beginning, meaning your data will start already tainted! This is possibly even worse than helicity errors or deadtime or missed online analyzer events (and it explains some of the really bizarre behavior I noticed when I first started doing these tests and is a testament to the usefulness of the et_monitor program)
- The 20 second sleep time seems to be necessary for DAQ resilience, and I don't know how to establish a way of hard-coding a check-for-64-bit-ET logic in there right now, but it should be done soon
Conclusions
We should use non-blocking ET bridge and either a blocking or non-blocking ET client station for the online analyzer (may as well go with blocking so it doesn't worry about messing up the event sequence in the relatively hidden 64 bit ET and lets that happen in the CODA monitored 32 bit one)
There is no deadtime in the ROC/DAQ/ADCs as long as the online client analyzer is faster than the event rate (i.e. for blocking ET bridge mode)
There are missed events in the online analyzer only if it is slower than the event rate or if the ET bridge's queue is too short to catch all of the events being thrown at it by the 32 bit ET system (queue of 500 is big enough to survive 1 kHz and should be the default setting)