Ticket #351 (closed defect: fixed)

Opened 7 years ago

Last modified 7 years ago

the jobq aborter is too risky

Reported by: jesus Assigned to: jesus
Priority: major Milestone:
Component: noit-core Severity: serious
Keywords: Cc:

Description

The current jobq aborter sends a pthread_kill to the thread with a signal which is caught and a siglongjump is performed.

So, we know the problems with that approach are that you can jump out of holding a malloc mutex or something truly evil that fuck everything.

That happens "too often" so, this last ditch, risky implementation is actually too risky.

We need something that can just signal into the thread and then check back later to see if we need to outright explode at some point to get restarted.

NOTE: investigate how pthread_cancel would interplay with this?

Change History

02/20/11 15:51:00 changed by jesus

(In [1557]) This patch does a lot, all refs #351

  • fix up the test harness to support noitd restarts and expected crashes
  • Add different cancellation methodologies to the jobq implemntation
    • "evil_brutal" which is the old siglongjmp way.
    • "cancel_deferred" which uses pthread_cancel w/ CANCEL_DEFERRED
    • "cancel_asynch" which uses pthread_cancel w/ CANCEL_ASYNCHRONOUS
  • Add a game over scenario is the cooperative cancellation mechanisms don't work and end up exhausting all the threads in a pool.
  • Reduce the minimum check period set via REST to 1s to enable better testing. NOTE: maybe this should be much smaller even.
  • Change the thread pool system to spawn as new jobs are queued. This isn't automatic demand-driven sizing, but rather we don't start the (N) threads until (N) events arrive (not necessarily concurrently).
  • Added a test_abort module that runs different types of faux workloads to assist in testing the functional correctness of each method. Workloads include, variable work time, variable method of cancellation type and interruptable (nanosleep) and uninterruptable (compute).
  • Added fairly thorough tests for each method under each workload condition. Tested on darwin (finding cancel_asynch to not work well). Needs testing on other platforms.

02/21/11 14:56:08 changed by jesus

  • summary changed from the jobq aborter is to risky to the jobq aborter is too risky.

03/31/11 02:01:51 changed by jesus

  • status changed from new to closed.
  • resolution set to fixed.

This appears to be working better now.