FetchOpHandle Error after using control_flow.Switch() #10490

sefira · 2018-05-08T08:46:08Z

I want to implement a exponential decay learning rate policy with warmup in https://github.com/sefira/models/blob/ssd_coco/fluid/object_detection/utility.py#L119.

But I find there is some strange thing in control_flow.Switch():
if use this code

        with control_flow.Switch() as switch:
            with switch.case(global_step < WARM_UP_ITERS):
                alpha = global_step / WARM_UP_ITERS
                warmup_factor = WARM_UP_FACTOR * (1 - alpha) + alpha
                warmup_val = (values[0] * warmup_factor)
                tensor.assign(warmup_val, lr)
            for i in range(len(boundaries)):
                boundary_val = tensor.fill_constant(
                    shape=[1], dtype='float32', value=float(boundaries[1]))
                value_var = tensor.fill_constant(
                    shape=[1], dtype='float32', value=float(values[1]))
                with switch.case(global_step < boundary_val):
                    tensor.assign(value_var, lr)
            with switch.default():
                last_value_var = tensor.fill_constant(
                    shape=[1],
                    dtype='float32',
                    value=float(values[len(values) - 1]))
                tensor.assign(last_value_var, lr)

above code can run. But with following code:

        with control_flow.Switch() as switch:
            with switch.case(global_step < WARM_UP_ITERS):
                alpha = global_step / WARM_UP_ITERS
                warmup_factor = WARM_UP_FACTOR * (1 - alpha) + alpha
                warmup_val = (values[0] * warmup_factor)
                tensor.assign(warmup_val, lr)
            boundary_val = tensor.fill_constant(
                shape=[1], dtype='float32', value=float(boundaries[1]))
            value_var = tensor.fill_constant(
                shape=[1], dtype='float32', value=float(values[1]))
            with switch.case(global_step < boundary_val):
                tensor.assign(value_var, lr)
            with switch.default():
                last_value_var = tensor.fill_constant(
                    shape=[1],
                    dtype='float32',
                    value=float(values[len(values) - 1]))
                tensor.assign(last_value_var, lr)

It will got

*** Aborted at 1525768737 (unix time) try "date -d @1525768737" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 40098 (TID 0x7f6e3fbd1700) from PID 0; stack trace: ***
    @       0x318b20f500 (unknown)
    @     0x7f738b7d52d2 paddle::framework::details::FetchOpHandle::RunImpl()
    @     0x7f738b7d8d9a paddle::framework::details::OpHandleBase::Run()
    @     0x7f738b7cef2c _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIvEES3_ESt12_Bind_simpleIFSt17reference_wrapperISt5_BindIFZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNSF_13BlockingQueueIPNSF_13VarHandleBaseEEEPNSF_12OpHandleBaseEEUlvE_vEEEvEEvEEE9_M_invokeERKSt9_Any_data
    @     0x7f738b6854af std::__future_base::_State_baseV2::_M_do_set()
    @       0x318b20cb23 (unknown)
    @     0x7f738b7cd3e8 _ZNSt17_Function_handlerIFvvEZN10ThreadPool7enqueueIRZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS5_13BlockingQueueIPNS5_13VarHandleBaseEEEPNS5_12OpHandleBaseEEUlvE_JEEESt6futureINSt9result_ofIFT_DpT0_EE4typeEEOSI_DpOSJ_EUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f738b7d2779 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN10ThreadPoolC4EmEUlvE_vEEE6_M_runEv
    @     0x7f742ee2b640 execute_native_thread_routine
    @       0x318b207851 (unknown)
    @       0x318aee767d (unknown)
    @                0x0 (unknown)

The only diff between above two codes is the later one removes the for loop.

The text was updated successfully, but these errors were encountered:

jacquesqiao · 2018-05-08T15:10:32Z

Can you use Executor instead of Parallel_Executor to have a test?

sefira · 2018-05-09T12:51:27Z

Using Executor will not report this error.
Moreover, if I don't fetch the learning_rate.name, then there will be not error even using Parallel_Executor (found by Wang Haoshaung).

chengduoZH · 2018-05-09T14:44:13Z

if I don't fetch the learning_rate.name, then there will be not error even using Parallel_Executor

It has been fixed by this PR #10454. Pleas pull the latest code.

sefira · 2018-05-21T03:36:33Z

fixed

Xreki added User 用于标记用户问题屯 labels May 8, 2018

sefira closed this as completed May 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FetchOpHandle Error after using control_flow.Switch() #10490

FetchOpHandle Error after using control_flow.Switch() #10490

sefira commented May 8, 2018 •

edited

Loading

jacquesqiao commented May 8, 2018

sefira commented May 9, 2018

chengduoZH commented May 9, 2018 •

edited

Loading

sefira commented May 21, 2018

FetchOpHandle Error after using control_flow.Switch() #10490

FetchOpHandle Error after using control_flow.Switch() #10490

Comments

sefira commented May 8, 2018 • edited Loading

jacquesqiao commented May 8, 2018

sefira commented May 9, 2018

chengduoZH commented May 9, 2018 • edited Loading

sefira commented May 21, 2018

sefira commented May 8, 2018 •

edited

Loading

chengduoZH commented May 9, 2018 •

edited

Loading