December 5th, 2018 Testers: Cameron
- Test ET deadtime with et_monitor C++ executable
See also ET Bridge overview
Program: In the bin or in the CODA install directory of choice you should find an already compiled program called "et_monitor" which accomplishes the same goals as the Java "monitorGui" except without using a gui - it prints the information directly to your monitor (meaning you will need to either pipe it to another readable output, pause the output using ctrl+S and +Q, or use a large enough screen to see the whole dump of information for each refresh).
Running: Execute the program with
et_monitor -f /path/to/et_sys_par_temp_file
and note that the temporary ET file is likely called /tmp/et_sys_par1 or 2. On the Hall A adaq computers I have placed the 32 and 64 bit versions (which attach to the 32 and 64 bit ets when they are running respectively) of et_monitor in the ~apar/bin/ folder, appropriately named et_monitor_32 and _64.
Outputs: The et_monitor will reveal the network of stations and buffers operating within the et system utilizing the /tmp/et_sys_par* file you pass it. The output is simple and easy to interpret (and significantly easier to interpret if you read the ET users manual first, and see ET Users Guide). Some things that would be normal to see in standard Prex/Crex DAQ running are as follows:
- 32 Bit ET: connected directly to CODA and event builder, recorder, and rc-gui. Has several ET "stations" running:
- "Grand Central Station": This is where event memory is allocated and cleared for reuse. The grand central station is the only mandatory part of the ET system and is configured automatically by the primary ET process. The user can edit properties, such as total number and size of events to be stored in memory and shuttled around. The C based ET system utilizes a disk resident file to keep track of all the components of the system and carries descriptors of each chunk (array of events sent around between stations) of events so that more complicated parallelized ET station configurations can operate unambiguously and avoid race conditions (not relevant for simple PREX/CREX uses). It is absolutely necessary that there be enough events and long enough buffers such that the Grand Central Station never runs out of available units of memory to assign new events for filling with data and writing to "Tape".
- "Tape": This is the "event recorder" which is where the data that has been filled by CODA is written to disk (and eventually sent to the MSS tape storage system by the user in their own time). It is absolutely imperative that this Tape station receives every single event generated and filled with data by the DAQ so that no events are lost or corrupted, and so this station is definitely to be run in blocking mode to ensure this.
- "ETBridge": In the 32 bit-to-64 bit paradigm of Prex/Crex where we will continue using a 32 bit CODA (2.6.2) but also want to analyze the data live with an online analysis program (stripped down Japan running feedback software) it is necessary to utilize a 64 bit ET system so that Japan can interface directly with CODA to allow for data processing as quickly as possible. Because it is impossible to recompile old CODA and ET to utilize 64 bit architecture in the 2.6.2 framework we instead rely on the readily available ETBridge ET program which is used to interface separated ET systems as a client program attached to its own station. This station can be run in either blocking or non-blocking mode, depending on if the user demands all events be processed by the second ET system attached on the other end of the bridge (blocking mode), or if the user prefers that the primary (32 bit) ET system operate unperturbed by any lag/delay/failure of the second ET system and its stations and client analyzers (non-blocking mode).
- 64 Bit ET:
- "Grand Central Station": Same idea as before, but this one receives already generated events from the ETBridge and puts them into he 64 bit ET cycle.
- "Client": This is a user defined client that can have a client program attached to do online analysis - note that there will be a lag of however large the "to chunk buffer" in the ETBridge is (and setting that chunk size to 1 will potentially take more than 1/1000Hz = 1 ms, so be careful when running at high rates, but low rates like 240 Hz should be manageable - I will test this explicitly later).
Deadtime: The concept of DAQ deadtime is a bit difficult to impose into the ET system, as the use of long buffers and chunks of events allows for a substantial separation of processes between different stations and bridged ET systems. One very important distinction to make is that whenever a station or client possesses an event it will hold onto that event until it is done with it or the client or station is closed (fatally or intentionally, both will return the claimed events back into the ET cycle flow). The difference between blocking and non-blocking treatment of events is that a blocking system will act like a dam, preventing any events from passing by it until more space in the input-buffer is available (as the result of successfully processing those events and placing them into the output buffer/immediately into the next station's input buffer if it is not also full), and a non-blocking system will act like a plunge pool (in the events as water in a stream analogy) where the waterfall of incoming events will dump enough into the hole at the bottom of the waterfall and once that hole is full the water will continue to flow over and past the hole, thus causing a mis-ordering of events (the initial water/events are now stuck or slowly moving out of the hole and the rest of the water/events continue to flow past it indifferently).
Lag: Note that in both of these cases (blocking and non-blocking) the process of filling up the hole until the events overflow (non-blocking) or stopping up the dam until the dam is opened and more events can come in and do it again (blocking) that however large the buffer (hole or dam volume) is will determine how much lag there is of the events being produced until they are passed into the client of the station (the station is the hole or dam) and onto the next station's input buffer. Consequently, the act of adding a ETBridge (some sort of lock system - ETBridge - to let a water treatment plant - the online analyzer - see all the events) will necessarily introduce a delay by however much its input buffer is filled by. This is avoidable if the buffer simply never gets backed up, meaning that the analyzer in the 64 bit ET is at least as fast as the data production rate with no hiccups so that the ETBridge never has to store anything in its buffer (it is fast enough at transferring that its buffer does not fill before it passes the data along as long as the chunk size is small but not excessively small such that the chunk sending rate isn't slower than the event rate - this is only potentially a problem if you use a high rate and a small chunk size in non-blocking mode ETBridge).
Possible Aborting: It is feasible that we could introduce a third mode of ETBridge that instead of blocking (and stopping new events from being filled with data at Grand Central Station once the available long total array of events is stopped up in the ETBridge dam) or non-blocking (in which case the events stuck in the plunge pool/input buffer of the ETBridge are mixed up and returned - if ever - to the main stream of events out of order), you can have a non-blocking-abort mode in which once the input buffer of the bridge is full (determined by querying the input buffer on its occupancy) it "gets" the full input buffer and "puts" it into the output buffer, bypassing actually analyzing any of the events, unloading everything fully into the stream and then beginning on a new buffer (so this would act exactly like a dam that fills to the brim and then naturally releases its load without breaking or overflowing or permanently stopping the flow), and it is also possible to just recycle those events directly into the Grand Central Station with a "dump" so that those events don't pollute any downstream stations that may be added (i.e. for a future experiment with multiple online analyzers). Although the ETBridge station is number 3, after the "Tape" process which is number 2, so that it doesn't matter if the ETBridge jumbles up the order of events, because the only thing these events have left to do is get recycled by the Grand Central Station, it may still be nice to have such a non-blocking-abort mode enabled, or at least to think about it for future experiments with higher rates and potential for failure.
Notes on Testing (Cameron's Notebook)
There are 4 cases of blocking/non-blocking mode in 2 ET bridged operations, mixing blocking and non-blocking ETBridge and the online analyzer client program in the 64 bit ET system.
- Runs 4706-4710 - testing etclient and the scalers plugged into it (they don't increment + 1 per pulse because the Tsettle time from the injector is not precisely right to accomplish that, so give up on using scalers for measurement and use et_monitor instead).
Goals for Tests: I want to see what is necessary to record all data collected (1 event per helicity flip) to "Tape" in the 32 bit ET, and try to analyze all/as much as possible in he online analyzer in the 64 bit ET using the ETBridge. Ideally I can plot the analyzer efficiency (analyzed rate/hel flip rate) vs. analysis time/event (a manipulated parameter) and vs. helicity flip rate for all 4 cases, though the interesting case would be blocking or non-blocking analyzer with a non-blocking bridge (since we do not want the bridge to ever be able to interrupt the Grand Central Station data collection and "Tape" recording). I would also like to plot the DAQ deadtime (missed event rate/helicity flip rate) vs. the same for for all 4 options, but this should only ever be able to be >0 when both analyzer and bridge are in blocking mode.
Checking effect of analyzer "client" queue buffer size:
- Using a blocking ETBridge and a blocking analyzer client station (definitely dangerous)
- Run 4713 - client q = 100, 30 Hz injector testing the analyzer still works, set the 2nd ET to have 1k total events (in Start2ndET_bkg script)
- Run 4714 - client q = 100, 1kHz flip rate from helicity control board in counting house crate (manually set in vxworks telnet connection)
- I see that the 64 bit ET buffer will fill up sequentially and keeps on feeding the client sequentially with no concern for the client's queue size (as in, the client queue does not have to be full before it will do something with the events, it will just churn through whatever it has received already, but if it gets full then the ET station input buffer will begin to fill up)
- One the "filled" buffer of ET64 ges read by the client then the data stored in the bridge's buffer will show up in the ET station buffer sequentially (skipping no events)
- Once he bridge buffer gets filled then then the ETBridge station prevents events from getting produced and further recycled in Grand Central Station
- This causes the read out list (ROL/CRL) to fail to cycle, fail to read out the VQWK ADCs, and misses physics DAQ triggered events
- I see definitively that an explicit 2 ms analysis time will cause the client and then the bridge buffers to fill up and then cause the 1 ms timescale DAQ to instantly slow down to approximately match that analysis time - this is the bad behavior we want to avoid, and it behaves very reasonably with no surprises
- Using a blocking ETBridge and a non-blocking analyzer client station (probably safe, unless a process dies in an unexpected way)
- Run 4715 - made a mistake
- Run 4716 - client q = 100, 1kHz flip rate
- The et client analyzer's buffer will now fill up (instead of the station buffer)
- Then the 64 bit ET client station just keeps flowing with data from input buffer to output buffer, bypassing the analyzer as long as its buffer is full, but giving it new events (now with missed events) when a chunk of events are analyzed and the buffer emptied a bit
- The issue here comes from when the bridge transfer speed gets maxed out (since it is in blocking mode), s.t. if it has a higher rate of data coming into it than it is able to transfer over to the 64 bit ET system then it will back up the DAQ (as before, except now it's not the analyzer's fault, but the user for defining too fast a flip rate and too small-granular a chunk of events to transfer at a time)
- There may have been a glitch that may have caused a false reading of non-trivial deadtime, so this test should be repeated with very small ETBridge chunks to see if the time-of-transfer is a potential source of generating DAQ deadtime at ~1kHz rates we care about
- Run 4717 - increase client buffer "queue" from 100 to 10,000 and also increase the 64 bit ET buffer to be 10,000 - appears to fix the previous issue, but I was unaware at this time that the bridge was blocking, and I think 4716 may have encountered a network or VQWK read/gate glitch to give it its deadtime, so these tests here need to be redone
- VQWK Errors:
- Runs 4718-4722, even without the 2nd ET client running the ADCs tripped up, so this means don't take Grand Central Station slowness as a guaranteed indicator of problems unless it is repeatable after resetting. Also, we need to determine if the ADCs are tripping because the 1kHz flip rate is too fast or something, which could be another indicator of problems (hopefully not).
- Checking if the 64 bit ET total buffer length or Client buffer size resolve the bridge backup-deadtime issue (still non-blocking client and blocking bridge):
- Run 4722 - set the client buffer back to 100 (from 10,000) - at ~500 Hz analyzer (i.e. 2x slower than data production rate) it will eventually back up and simply miss events, not affecting the bridge, independent of Hz actually
- Run 4723 - now setting the 2nd ET total buffer to 100 (and client to 99, since it needs to be smaller than the total or else it fails) - has no new effect (note that the ET manual says this is effectively putting the client into blocking mode, since basically no events will get past the client station without being stuck in it - just the 100-99 = 1 event per chunk passing)
- Run 4724 - recreate prior behavior with 1k total 64 bit buffer and 100 ET client station buffer - non-blocking client - has no effect on deadtime in the bridge or in the DAQ - this leads me to believe that the apparent DAQ deadtime in run 4716 was actually due to a vqwk readout error and not slowness of the analyzer/bridge
- Run 4277 - try again, but now in blocking client mode and with small buffers - now it fails as expected, and it means either I failed to recompile the ET client header's .so files or there was a vqwk error. I need to be more careful now when changing these settings around
- Likely Conclusions:
- Blocking client or bridge is safe iff the analyzer is faster than the Helicity flip rate/trigger.
- Blocking of the client, non-blocking of the bridge is always safe, but the analyzer will miss the data that the bridge skips over, and chunk size in the bridge-to-2nd ET transfer can be a bottle neck
- Non-blocking of the client and of the bridge is safe, analyzer can miss still
- Non-blocking of the client, but blocking of the bridge is potentially not safe if the chunk size for passing events into the analyzer takes more time to pass those events than it takes for the DAQ to produce that many events (i.e. if time/event * chunk size > time to pass a chunk)
- These conclusions need to be tested with small queues and buffers and for 30 Hz (injector) and 1kHz (helicity control board) again, using the helicity predictor algorithm as a master-missed-events deadtime indicator
Testing the non-blocking bridge:
- Non-blocking ET client station:
- Run 4729 - using a queue size of 10 in the bridge misses 90/100 of the data (32 bit ET system has 100 length queue)
- Run 4730 - increase the queue of the bridge to 100, now it grabs all of the events, but the transfer to the 64 bit ET is ~1/2 the speed we need (1kHz)
- Run 4731 - set bridge q = 175, 64 bit ET is still slow, and misses chunks of ~100 events at a time
- Run 4732 - q = 500, works! See ET manual section 5.7 for details
- Run 4733 - q = 175 double check, yup same issues
- Run 4734 - q = 60,000, to test if the queue must be filled before passing events on - nope, it is merely an upper limit
- Run 4735 - q = 100, now set chunk size = 10 (was 100 default) - interestingly this still misses events, and it looks like the DAQ gets backed up - is this a vqwk error again, or is it real, meaning the non-blocking bridge can be a problem with chunk size even in non-blocking mode? Needs to be double checked
- Run 4736 - q = 100 double check, starts off with the same ~25% deadtime, but then halfway through the run it recovered and went to expected 0% deadtime - vqwk glitch confirmation?
- Blocking ET client station:
- Run 4737 - q = 10, c = 100 (default) - bridge is too slow, drops 90/100 events
- Run 4738 - q = 500, c = 100 - works for fast Analyzer
- Slow analyzer does stuff - once the analyzer client station has backed up then the whole 64 Bit ET system slows down (as expected), then this slowness is picked up by the non-blocking bridge, even though its buffer doesn't fill up, and the 32 Bit ET doesn't care and continues happily (as expected)
- Run 4740 - set the ET buffers small to see if the ET bridge affects the 32 bit system or misses its data (as expected in non-blocking mode)
- Yes, once you fill up the client buffer and entire 64 bit ET buffer then the bridge will simply throw data at the 2nd ET system and the excess will skip into the output buffer and into the grand central station of the 32 bit buffer unperturbed
- if the buffers are big enough then there will be no issue with skipping events in the online analyzer, but at least the 32 bit ET system is unable to be affected in this setup - safe!
- Run 4742 - return DAQ to standard behavior
Some notes from the ET manual:
- For starting the bridge after the 2nd ET system is alive (to avoid crashes), instead of "sleep 20" we can instead use on of the following: et_open_config_setWait(), et_bridge_config_setMode(ET_TIMED), or most promisingly et_wait_for_alive(/tmp/et_sys_par2)
- We should ensure that the ETBridge is always the last element in the chain, so that any non-blocking event skipping will never cause the events to get jumbled up on their way to being recorded in "Tape" station
- Section 4.6 - serial event flow
- Event memory is finite, which means that a client (or bridge) can hold onto any given event forever with no consequence on the "Tape" recorder, as long as the recorder is upstream
- But, you should never set the input/output list size for the client (or bridge) to be ~ the total number of events in the ET system with the "Tape" recorder in it, since the memory would get used up (in blocking or non-blocking mode) and further event production would be prevented
- Does a blocked 32 bit ET grand central (i.e. the bridge filled up and slows it down) actually prevent the CRL from executing and reading out the vqwks or does it just fail to store those events information into the memory? And if the Grand Central Station glitches or slows down due to external effects then does the CRL care? Is there a safe workaround? - No clear answers right now, but if we avoid slowness in the 32 Bit ET system then these are inconsequential
- To solve the issue of large EPICS data reads crashing the ET system then just use the virtual memory allocation technique with -r or -s, whichever it is