Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FetchOpHandle Error after using control_flow.Switch() #10490

Closed
sefira opened this issue May 8, 2018 · 4 comments
Closed

FetchOpHandle Error after using control_flow.Switch() #10490

sefira opened this issue May 8, 2018 · 4 comments
Labels
User 用于标记用户问题

Comments

@sefira
Copy link
Contributor

sefira commented May 8, 2018

I want to implement a exponential decay learning rate policy with warmup in https://github.com/sefira/models/blob/ssd_coco/fluid/object_detection/utility.py#L119.

But I find there is some strange thing in control_flow.Switch():
if use this code

        with control_flow.Switch() as switch:
            with switch.case(global_step < WARM_UP_ITERS):
                alpha = global_step / WARM_UP_ITERS
                warmup_factor = WARM_UP_FACTOR * (1 - alpha) + alpha
                warmup_val = (values[0] * warmup_factor)
                tensor.assign(warmup_val, lr)
            for i in range(len(boundaries)):
                boundary_val = tensor.fill_constant(
                    shape=[1], dtype='float32', value=float(boundaries[1]))
                value_var = tensor.fill_constant(
                    shape=[1], dtype='float32', value=float(values[1]))
                with switch.case(global_step < boundary_val):
                    tensor.assign(value_var, lr)
            with switch.default():
                last_value_var = tensor.fill_constant(
                    shape=[1],
                    dtype='float32',
                    value=float(values[len(values) - 1]))
                tensor.assign(last_value_var, lr)

above code can run. But with following code:

        with control_flow.Switch() as switch:
            with switch.case(global_step < WARM_UP_ITERS):
                alpha = global_step / WARM_UP_ITERS
                warmup_factor = WARM_UP_FACTOR * (1 - alpha) + alpha
                warmup_val = (values[0] * warmup_factor)
                tensor.assign(warmup_val, lr)
            boundary_val = tensor.fill_constant(
                shape=[1], dtype='float32', value=float(boundaries[1]))
            value_var = tensor.fill_constant(
                shape=[1], dtype='float32', value=float(values[1]))
            with switch.case(global_step < boundary_val):
                tensor.assign(value_var, lr)
            with switch.default():
                last_value_var = tensor.fill_constant(
                    shape=[1],
                    dtype='float32',
                    value=float(values[len(values) - 1]))
                tensor.assign(last_value_var, lr)

It will got

*** Aborted at 1525768737 (unix time) try "date -d @1525768737" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 40098 (TID 0x7f6e3fbd1700) from PID 0; stack trace: ***
    @       0x318b20f500 (unknown)
    @     0x7f738b7d52d2 paddle::framework::details::FetchOpHandle::RunImpl()
    @     0x7f738b7d8d9a paddle::framework::details::OpHandleBase::Run()
    @     0x7f738b7cef2c _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIvEES3_ESt12_Bind_simpleIFSt17reference_wrapperISt5_BindIFZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNSF_13BlockingQueueIPNSF_13VarHandleBaseEEEPNSF_12OpHandleBaseEEUlvE_vEEEvEEvEEE9_M_invokeERKSt9_Any_data
    @     0x7f738b6854af std::__future_base::_State_baseV2::_M_do_set()
    @       0x318b20cb23 (unknown)
    @     0x7f738b7cd3e8 _ZNSt17_Function_handlerIFvvEZN10ThreadPool7enqueueIRZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS5_13BlockingQueueIPNS5_13VarHandleBaseEEEPNS5_12OpHandleBaseEEUlvE_JEEESt6futureINSt9result_ofIFT_DpT0_EE4typeEEOSI_DpOSJ_EUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f738b7d2779 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN10ThreadPoolC4EmEUlvE_vEEE6_M_runEv
    @     0x7f742ee2b640 execute_native_thread_routine
    @       0x318b207851 (unknown)
    @       0x318aee767d (unknown)
    @                0x0 (unknown)

The only diff between above two codes is the later one removes the for loop.

@Xreki Xreki added User 用于标记用户问题 labels May 8, 2018
@jacquesqiao
Copy link
Member

Can you use Executor instead of Parallel_Executor to have a test?

@sefira
Copy link
Contributor Author

sefira commented May 9, 2018

Using Executor will not report this error.
Moreover, if I don't fetch the learning_rate.name, then there will be not error even using Parallel_Executor (found by Wang Haoshaung).

@chengduoZH
Copy link
Contributor

chengduoZH commented May 9, 2018

if I don't fetch the learning_rate.name, then there will be not error even using Parallel_Executor

It has been fixed by this PR #10454. Pleas pull the latest code.

@sefira
Copy link
Contributor Author

sefira commented May 21, 2018

fixed

@sefira sefira closed this as completed May 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

4 participants