Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new platform management API. Currently with support for chassis, PSUs, fans and watchdog #13

Merged
merged 17 commits into from
Dec 4, 2018
Merged

Conversation

jleveque
Copy link
Contributor

@jleveque jleveque commented Sep 13, 2018

  • New platform API is installed as a Python package, "sonic_platform_base" which consists of abstract base classes for each type of platform peripheral device

  • New API is object oriented and is structured in a hierarchy based on the relationship of devices. Root class is "platform" from which one can interact with all available chassis on the platform. From each chassis object, one can interact with all available fan modules, PSUs, SFP transceivers, etc. available on that chassis, etc.

  • Platform vendors will create a concrete implementation of this package, resulting in a platform-specific "sonic_platform" package, one per platform. The method of implementation will be very similar to the current "plugin" based model, but instead of creating individual plugins, the vendor will create an entire Python package which will be loaded by applications.

  • This commit provides abstract base classes for platform, chassis, fan modules and PSUs. Classes for other devices will be added incrementally. This package will get installed alongside the old platform API packages until it is complete and all platforms have implemented it. At that point in time, we can remove the old packages.

@jleveque jleveque self-assigned this Sep 13, 2018
Copy link
Contributor

@Staphylo Staphylo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall.
There will be some overlap with existing drivers and errors codes / exceptions could be defined.
For example if a power supply has its fans in read only mode, what should be the result of set_fan_speed.

sonic_platform_base/chassis_base.py Outdated Show resolved Hide resolved
sonic_platform_base/platform_base.py Show resolved Hide resolved
return None

@abc.abstractmethod
def get_reboot_cause(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reboot cause can be persistent on some product and have to be explicitly cleared.
Otherwise reading this information after a reboot will show all the reload causes that happened previous times.
Some platform might also have limitation on the number of reboot causes that can be stored.

I would suggest either a default parameter clear=False to this method or to add a new one called clear_reboot_cause.
Also I think this function should only be called once with clear=True and the result cached somewhere in a tmpfs. This means that future asks for reload cause information would go through the cache.

Copy link
Contributor Author

@jleveque jleveque Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reboot cause can be persistent on some product and have to be explicitly cleared.
Otherwise reading this information after a reboot will show all the reload causes that happened previous times.

I would suggest either a default parameter clear=False to this method or to add a new one called clear_reboot_cause.
Also I think this function should only be called once with clear=True and the result cached somewhere in a tmpfs. This means that future asks for reload cause information would go through the cache.

Since the need to clear the reboot cause after reading is platform-specific functionality, I believe this logic should be left to the platform-specific implementation of the function. If the platform requires clearing the reboot cause, it should be done in the implementation and a file saved to tmpfs. The function can check for the presence of the file first before attempting to read/clear the register. I prefer this to adding a clear=True parameter to the base method, and I think adding a clear_reboot_cause() method is unnecessary, because there should be no need to explicitly clear the reboot cause.

Some platform might also have limitation on the number of reboot causes that can be stored.

If you're concerned that not all platforms will support all reboot causes, that is expected. The base class will define all possible reboot causes that SONiC understands; the platform-specific implementation will simply return the subset which is applicable to that platform. If this doesn't address your concern, please elaborate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reboot cause can be persistent on some product and have to be explicitly cleared.
Otherwise reading this information after a reboot will show all the reload causes that happened previous times.
I would suggest either a default parameter clear=False to this method or to add a new one called clear_reboot_cause.
Also I think this function should only be called once with clear=True and the result cached somewhere in a tmpfs. This means that future asks for reload cause information would go through the cache.

Since the need to clear the reboot cause after reading is platform-specific functionality, I believe this logic should be left to the platform-specific implementation of the function. If the platform requires clearing the reboot cause, it should be done in the implementation and a file saved to tmpfs. The function can check for the presence of the file first before attempting to read/clear the register. I prefer this to adding a clear=True parameter to the base method, and I think adding a clear_reboot_cause() method is unnecessary, because there should be no need to explicitly clear the reboot cause.

I would assume all the platforms to behave the same way with regard to the reboot causes. However this information would have to be crowd-sourced from other vendors as well for more relevant data.
I'm fine with letting the responsibility of clearing and caching the reboot cause to the plugin implementation.
Is there a place in SONiC where such cache files should go? Currently almost everything is backed by the flash including /tmp. The only place that is a tmpfs and would feel "suitable" seems to be /run. Is there some guidelines somewhere regarding this? I would be interested to know the reason why /tmp is not a tmpfs.

Some platform might also have limitation on the number of reboot causes that can be stored.

If you're concerned that not all platforms will support all reboot causes, that is expected. The base class will define all possible reboot causes that SONiC understands; the platform-specific implementation will simply return the subset which is applicable to that platform. If this doesn't address your concern, please elaborate.

My concern was a bit different. Your previous answer ended up answered my question but I'll elaborate a bit more.
Basically the concept of a reboot cause is to be stored on an external device until being read.
These devices usually have space restriction and can only save a given number of reboot cause after which they won't store anything until being cleared. For example if your device can store 2 reboot cause and you trigger 2 watchdogs and then 1 powerloss, you might not see the powerloss if you haven't cleared the reboot cause after each reboot.

sonic_platform_base/chassis_base.py Outdated Show resolved Hide resolved
sonic_platform_base/fan_base.py Outdated Show resolved Hide resolved
return 0

@abc.abstractmethod
def get_speed_tolerance(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is only for the fan daemon logic usage?

Copy link
Contributor Author

@jleveque jleveque Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the initial intention. We may find other uses in the future, however. Is there a concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concerns. I'm was just curious if the fand was in the roadmap.

"""
return 0

@abc.abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this method have to be abstract or a default implementation that is always overridable would do?

Copy link
Contributor Author

@jleveque jleveque Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming tolerance is most commonly implemented as a static attribute, we could define a tolerance member and have that returned here as the default implementation. Do you think that would suffice for most implementations?

Sets the fan speed

Args:
speed: An integer, the percentage of full fan speed to set fan to,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see everywhere that the fan speed is an int from 0 to 100.
Some device dealing with fans can have finer grained control (like 0 to 255) which means loosing a bit of precision.
Was this taken into consideration and integer deemed better than floats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer a 0.0 ... 1.0 float scale? This should allow for finer-grained control if necessary, while maintaining a more universal scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way, integer percentages 0 to 100 sounds good to me.
I just wanted to bring this topic up as the sysfs interface defines a pwm value to be ranging from 0 to 255.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all platforms used sysfs, the 0-255 scale would be fine. However, we'd like a universal scale that will work with other interfaces (IPMI, etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong preference. I just wanted to bring that topic to make sure it wasn't overlooked.
A percentage sounds perfect to me.

sonic_platform_base/psu_base.py Outdated Show resolved Hide resolved
@keboliu
Copy link
Collaborator

keboliu commented Oct 16, 2018

one general question, previously when add a new API to the base class, need to implement a stub to all sub class for each platform (as sonic-net/sonic-buildimage#2017 described), can this be avoided in the new implementation?

…otImplementedError

 - We plan to implement a build-time or run-time check for all methods
lacking platform-specific implemention and will output a list of them.
@jleveque
Copy link
Contributor Author

jleveque commented Nov 5, 2018

@keboliu:

one general question, previously when add a new API to the base class, need to implement a stub to all sub class for each platform (as Azure/sonic-buildimage#2017 described), can this be avoided in the new implementation?

This has been address in: bd6b6ed. In lieu of an abstract base class, we will implement either a build-time or run-time check that will output all methods which are not implemented in order to alert vendors to methods they may have missed.

# watchdog_base.py
#
# Abstract base class for implementing a platform-specific class with which
# to interact with a hardware watchdog module in SONiC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know if the platform has such capability or not? do we have query API for this?

Copy link
Contributor Author

@jleveque jleveque Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the need for additional functions with which to query device/capability availability, the current design is such that the querying is done by getting the device object. If the object is null (None), the device/capability is not available on the platform. For example:

#!/usr/bin/env python

import sonic_platform

platform = sonic_platform.platform.Platform()

chassis = platform.get_chassis()
if not chassis:
    print("Error getting chassis!")
    sys.exit(1)

watchdog = chassis.get_watchdog()
if not watchdog:
    print("Watchdog not present!")
else:
    watchdog.arm(10)

status='1' represents device inserted,
status='0' represents device removed. Ex. {'0': '1', '1': '0'}
"""
raise NotImplementedError
Copy link

@padmanarayana padmanarayana Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to get confirmation that if a platform takes a Baseboard Management Controller (BMC) based approach (to handle thermals, events etc), these (and the other such APIs) can remain unimplemented..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention here is for daemons to receive notifications of changes in platform hardware state in order to update SONiC's view of device state. For instance, if a tranceiver, PSU or fan tray is added or removed. Can you provide an example of an event handled by the BMC which SONiC would not need to be made aware of?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the events referred to is specifically to track the presence/absence i..e OIR of the modules (essentially this API seems to be a generalization of the get_transceiver_change_event()).

If the BMC implements the thermals, an addition/removal of FAN tray would be completely handled by the BMC itself - but atleast for generating a syslog this API needs to be implemented for SONiC..

@jleveque jleveque changed the title Add new platform management API. Currently with support for chassis, PSUs and fans Add new platform management API. Currently with support for chassis, PSUs, fans and watchdog Nov 28, 2018
@jleveque jleveque merged commit d4bf78c into sonic-net:master Dec 4, 2018
@jleveque jleveque deleted the new_platform_api branch December 4, 2018 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants