Welcome! Log In Create A New Profile

Advanced

[discussion]sched: a rough proposal to enable power saving in scheduler

Posted by Alex Shi 
On Mon, Aug 20, 2012 at 03:47:54PM +0000, Christoph Lameter wrote:

> So please make sure that there are obvious and easy ways to switch this
> stuff off or provide "low latency" know that keeps the system from
> assuming that idle time means that full performance is not needed.

That seems like an issue for cpuidle, not the scheduler. Does pm_qos not
already do what you want?

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 17 August 2012 10:43, Paul Turner <[email protected]> wrote:
> On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra <[email protected]> wrote:
>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>>> Since there is no power saving consideration in scheduler CFS, I has a
>>> very rough idea for enabling a new power saving schema in CFS.
>>
>> Adding Thomas, he always delights poking holes in power schemes.
>>
>>> It bases on the following assumption:
>>> 1, If there are many task crowd in system, just let few domain cpus
>>> running and let other cpus idle can not save power. Let all cpu take the
>>> load, finish tasks early, and then get into idle. will save more power
>>> and have better user experience.
>>
>> I'm not sure this is a valid assumption. I've had it explained to me by
>> various people that race-to-idle isn't always the best thing. It has to
>> do with the cost of switching power states and the duration of execution
>> and other such things.
>>
>>> 2, schedule domain, schedule group perfect match the hardware, and
>>> the power consumption unit. So, pull tasks out of a domain means
>>> potentially this power consumption unit idle.
>>
>> I'm not sure I understand what you're saying, sorry.
>>
>>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
>> between the two based on AC/BAT, UPS status and simple things like that.
>> But this seems like a later concern, you have to have something to pick
>> between before you can pick :-)
>>
>>> And in scheduling, 2 place will care the policy, load_balance() and in
>>> task fork/exec: select_task_rq_fair().
>>
>> ack
>>
>>> Here is some pseudo code try to explain the proposal behaviour in
>>> load_balance() and select_task_rq_fair();
>>
>> Oh man.. A few words outlining the general idea would've been nice.
>>
>>> load_balance() {
>>> update_sd_lb_stats(); //get busiest group, idlest group data.
>>>
>>> if (sd->nr_running > sd's capacity) {
>>> //power saving policy is not suitable for
>>> //this scenario, it runs like performance policy
>>> mv tasks from busiest cpu in busiest group to
>>> idlest cpu in idlest group;
>>
>> Once upon a time we talked about adding a factor to the capacity for
>> this. So say you'd allow 2*capacity before overflowing and waking
>> another power group.
>>
>> But I think we should not go on nr_running here, PJTs per-entity load
>> tracking stuff gives us much better measures -- also, repost that series
>> already Paul! :-)
>
> Yes -- I just got back from Africa this week. It's updated for almost
> all the previous comments but I ran out of time before I left to
> re-post. I'm just about caught up enough that I should be able to get
> this done over the upcoming weekend. Monday at the latest.
>
>>
>> Also, I'm not sure this is entirely correct, the thing you want to do
>> for power aware stuff is to minimize the number of active power domains,
>> this means you don't want idlest, you want least busy non-idle.
>>
>>> } else {// the sd has enough capacity to hold all tasks.
>>> if (sg->nr_running > sg's capacity) {
>>> //imbalanced between groups
>>> if (schedule policy == performance) {
>>> //when 2 busiest group at same busy
>>> //degree, need to prefer the one has
>>> // softest group??
>>> move tasks from busiest group to
>>> idletest group;
>>
>> So I'd leave the currently implemented scheme as performance, and I
>> don't think the above describes the current state.
>>
>>> } else if (schedule policy == power)
>>> move tasks from busiest group to
>>> idlest group until busiest is just full
>>> of capacity.
>>> //the busiest group can balance
>>> //internally after next time LB,
>>
>> There's another thing we need to do, and that is collect tasks in a
>> minimal amount of power domains. The old code (that got deleted) did
>> something like that, you can revive some of the that code if needed -- I
>> just killed everything to be able to start with a clean slate.
>>
>>
>>> } else {
>>> //all groups has enough capacity for its tasks.
>>> if (schedule policy == performance)
>>> //all tasks may has enough cpu
>>> //resources to run,
>>> //mv tasks from busiest to idlest group?
>>> //no, at this time, it's better to keep
>>> //the task on current cpu.
>>> //so, it is maybe better to do balance
>>> //in each of groups
>>> for_each_imbalance_groups()
>>> move tasks from busiest cpu to
>>> idlest cpu in each of groups;
>>> else if (schedule policy == power) {
>>> if (no hard pin in idlest group)
>>> mv tasks from idlest group to
>>> busiest until busiest full.
>>> else
>>> mv unpin tasks to the biggest
>>> hard pin group.
>>> }
>>> }
>>> }
>>> }
>>
>> OK, so you only start to group later.. I think we can do better than
>> that.
>>
>>>
>>> sub proposal:
>>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>>> cpu'. If so, it may can reduce one more time balancing.
>>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>>> 2, se or task load is good for running time setting.
>>> but it should the second basis in load balancing. The first basis of LB
>>> is running tasks' number in group/cpu. Since whatever of the weight of
>>> groups is, if the tasks number is less than cpu number, the group is
>>> still has capacity to take more tasks. (will consider the SMT cpu power
>>> or other big/little cpu capacity on ARM.)
>>
>> Ah, no we shouldn't balance on nr_running, but on the amount of time
>> consumed. Imagine two tasks being woken at the same time, both tasks
>> will only run a fraction of the available time, you don't want this to
>> exceed your capacity because ran back to back the one cpu will still be
>> mostly idle.
>>
>> What you want it to keep track of a per-cpu utilization level (inverse
>> of idle-time) and using PJTs per-task runnable avg see if placing the
>> new task on will exceed the utilization limit.
>
> Observations of the runnable average also have the nice property that
> it quickly converges to 100% when over-scheduled.
>
> Since we also have the usage average for a single task the ratio of
> used avg:runnable avg is likely a useful pointwise estimate.

yes that's clearly a good input from your per-task load tracking. You
can have a core which is 100% used by several tasks. In one case the
used avg and the runnable avg are quite similar which means that we
don't waiting for the core too much and in the other case the runnable
avg can be max value which means that tasks are waiting for the core
and it's worth using 2 cores in the same clusters

Vincent
>
>>
>> I think some of the Linaro people actually played around with this,
>> Vincent?
>>
>>> unsolved issues:
>>> 1, like current scheduler, it didn't handled cpu affinity well in
>>> load_balance.
>>
>> cpu affinity is always 'fun'.. while there's still a few fun sites in
>> the current load-balancer we do better than we did a while ago.
>>
>>> 2, task group that isn't consider well in this rough proposal.
>>
>> You mean the cgroup mess?
>>
>>> It isn't consider well and may has mistaken . So just share my ideas and
>>> hope it become better and workable in your comments and discussion.
>>
>> Very simplistically the current scheme is a 'spread' the load scheme
>> (SD_PREFER_SIBLING if you will). We spread load to maximize per-task
>> cache and cpu power.
>>
>> The power scheme should be a 'pack' scheme, where we minimize the active
>> power domains.
>>
>> One way to implement this is to keep track of an active and
>> under-utilized power domain (the target) and fail the regular (pull)
>> load-balance for all cpus not in that domain. For the cpu that are in
>> that domain we'll have find_busiest select from all other under-utilized
>> domains pulling tasks to fill our target, once full, we pick a new
>> target, goto 1.
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote:

> If the answer is 'yes' then there's clear cases where the kernel
> (should) automatically know the events where we switch from
> balancing for performance to balancing for power:

No. We can't identify all of these cases and we can't identify corner
cases. Putting this kind of policy in the kernel is an awful idea. It
should never be altering policy itself, because it'll get it wrong and
people will file bugs complaining that it got it wrong and the biggest
case where you *need* to be able to handle switching between performance
and power optimisations (your rack management unit just told you that
you're going to have to drop power consumption by 20W) is one where the
kernel doesn't have all the information it needs to do this. So why
bother at all?

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Mon, 20 Aug 2012, Matthew Garrett wrote:

> On Mon, Aug 20, 2012 at 03:47:54PM +0000, Christoph Lameter wrote:
>
> > So please make sure that there are obvious and easy ways to switch this
> > stuff off or provide "low latency" know that keeps the system from
> > assuming that idle time means that full performance is not needed.
>
> That seems like an issue for cpuidle, not the scheduler. Does pm_qos not
> already do what you want?

Dont know. A simple solution is not to compile power management into the
kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/20/2012 11:36 PM, Vincent Guittot wrote:

>> > What you want it to keep track of a per-cpu utilization level (inverse
>> > of idle-time) and using PJTs per-task runnable avg see if placing the
>> > new task on will exceed the utilization limit.
>> >
>> > I think some of the Linaro people actually played around with this,
>> > Vincent?
> Sorry for the late reply but I had almost no network access during last weeks.
>
> So Linaro also works on a power aware scheduler as Peter mentioned.
>
> Based on previous tests, we have concluded that main drawback of the
> (now removed) old power scheduler was that we had no way to make
> difference between short and long running tasks whereas it's a key
> input (at least for phone) for deciding to pack tasks and for
> selecting the core on an asymmetric system.


It is hard to estimate future in general view point. but from hack
point, maybe you can add something to hint this from task_struct. :)

> One additional key information is the power distribution in the system
> which can have a finer granularity than current sched_domain
> description. Peter's proposal was to use a SHARE_POWERLINE flag
> similarly to flags that already describe if a sched_domain share
> resources or cpu capacity.


Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
and SD_SHARE_PKG_RESOURCES.

>
> With these 2 new information, we can have a 1st power saving scheduler
> which spread or packed tasks across core and package


Fine, I like to test them on X86, plus SMT and NUMA :)

>
> Vincent


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/20/2012 11:47 PM, Vincent Guittot wrote:

> On 16 August 2012 07:03, Alex Shi <[email protected]> wrote:
>> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>>
>>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>>
>>>> power aware scheduling), this proposal will adopt the
>>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>>
>>> Are there workloads in which "power" might provide more performance than
>>> "performance"? If so, don't use these terms.
>>>
>>
>>
>> Power scheme should no chance has better performance in design.
>
> A side effect of packing small tasks on one core is that you always
> use the core with the lowest C-state which will minimize the wake up
> latency so you can sometime get better results than performance mode
> which will try to use a other core in another cluster which will take
> more time to wake up that waiting for the end of the current task.
>


Sure. some scenario packing tasks into smaller domain will bring
performance benefit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
* Matthew Garrett <[email protected]> wrote:

> On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote:
>
> > If the answer is 'yes' then there's clear cases where the kernel
> > (should) automatically know the events where we switch from
> > balancing for performance to balancing for power:
>
> No. We can't identify all of these cases and we can't identify
> corner cases. [...]

There's no need to identify 'all' of these cases - but if the
kernel knows then it can have intelligent default behavior.

> [...] Putting this kind of policy in the kernel is an awful
> idea. [...]

A modern kernel better know what state the system is in: on
battery or on AC power.

> [...] It should never be altering policy itself, [...]

The kernel/scheduler simply offers sensible defaults where it
can. User-space can augment/modify/override that in any which
way it wishes to.

This stuff has not been properly sorted out in the last 10+
years since we have battery driven devices, so we might as well
start with the kernel offering sane default behavior where it
can ...

> [...] because it'll get it wrong and people will file bugs
> complaining that it got it wrong and the biggest case where
> you *need* to be able to handle switching between performance
> and power optimisations (your rack management unit just told
> you that you're going to have to drop power consumption by
> 20W) is one where the kernel doesn't have all the information
> it needs to do this. So why bother at all?

The point is to have a working default mechanism.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 21 August 2012 02:58, Alex Shi <[email protected]> wrote:
> On 08/20/2012 11:36 PM, Vincent Guittot wrote:
>
>>> > What you want it to keep track of a per-cpu utilization level (inverse
>>> > of idle-time) and using PJTs per-task runnable avg see if placing the
>>> > new task on will exceed the utilization limit.
>>> >
>>> > I think some of the Linaro people actually played around with this,
>>> > Vincent?
>> Sorry for the late reply but I had almost no network access during last weeks.
>>
>> So Linaro also works on a power aware scheduler as Peter mentioned.
>>
>> Based on previous tests, we have concluded that main drawback of the
>> (now removed) old power scheduler was that we had no way to make
>> difference between short and long running tasks whereas it's a key
>> input (at least for phone) for deciding to pack tasks and for
>> selecting the core on an asymmetric system.
>
>
> It is hard to estimate future in general view point. but from hack
> point, maybe you can add something to hint this from task_struct. :)
>

per-task load tracking patchsets give you a good view of the last dozen of ms

>> One additional key information is the power distribution in the system
>> which can have a finer granularity than current sched_domain
>> description. Peter's proposal was to use a SHARE_POWERLINE flag
>> similarly to flags that already describe if a sched_domain share
>> resources or cpu capacity.
>
>
> Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
> and SD_SHARE_PKG_RESOURCES.

SD_SHARE_CPUPOWER is set in a sched domain at SMT level (sharing some
part of the physical core)
SD_SHARE_PKG_RESOURCES is set at MC level (sharing some resources like
cache and memory access)

>
>>
>> With these 2 new information, we can have a 1st power saving scheduler
>> which spread or packed tasks across core and package
>
>
> Fine, I like to test them on X86, plus SMT and NUMA :)
>
>>
>> Vincent
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote:
> * Matthew Garrett <[email protected]> wrote:
> > [...] Putting this kind of policy in the kernel is an awful
> > idea. [...]
>
> A modern kernel better know what state the system is in: on
> battery or on AC power.

That's a fundamentally uninteresting thing for the kernel to know about.
AC/battery is just not an important power management policy input when
compared to various other things.

> > [...] It should never be altering policy itself, [...]
>
> The kernel/scheduler simply offers sensible defaults where it
> can. User-space can augment/modify/override that in any which
> way it wishes to.
>
> This stuff has not been properly sorted out in the last 10+
> years since we have battery driven devices, so we might as well
> start with the kernel offering sane default behavior where it
> can ...

Userspace has been doing a perfectly reasonable job of determining
policy here.

> > [...] because it'll get it wrong and people will file bugs
> > complaining that it got it wrong and the biggest case where
> > you *need* to be able to handle switching between performance
> > and power optimisations (your rack management unit just told
> > you that you're going to have to drop power consumption by
> > 20W) is one where the kernel doesn't have all the information
> > it needs to do this. So why bother at all?
>
> The point is to have a working default mechanism.

Your suggestions aren't a working default mechanism.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
* Matthew Garrett <[email protected]> wrote:

> On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote:
> > * Matthew Garrett <[email protected]> wrote:
> > > [...] Putting this kind of policy in the kernel is an awful
> > > idea. [...]
> >
> > A modern kernel better know what state the system is in: on
> > battery or on AC power.
>
> That's a fundamentally uninteresting thing for the kernel to
> know about. [...]

I disagree.

> [...] AC/battery is just not an important power management
> policy input when compared to various other things.

Such as?

The thing is, when I use Linux on a laptop then AC/battery is
*the* main policy input.

> > > [...] It should never be altering policy itself, [...]
> >
> > The kernel/scheduler simply offers sensible defaults where
> > it can. User-space can augment/modify/override that in any
> > which way it wishes to.
> >
> > This stuff has not been properly sorted out in the last 10+
> > years since we have battery driven devices, so we might as
> > well start with the kernel offering sane default behavior
> > where it can ...
>
> Userspace has been doing a perfectly reasonable job of
> determining policy here.

Has it properly switched the scheduler's balancing between
power-effient and performance-maximizing strategies when for
example a laptop's AC got unplugged/replugged?

> > > [...] because it'll get it wrong and people will file bugs
> > > complaining that it got it wrong and the biggest case
> > > where you *need* to be able to handle switching between
> > > performance and power optimisations (your rack management
> > > unit just told you that you're going to have to drop power
> > > consumption by 20W) is one where the kernel doesn't have
> > > all the information it needs to do this. So why bother at
> > > all?
> >
> > The point is to have a working default mechanism.
>
> Your suggestions aren't a working default mechanism.

In what way?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
>>> A modern kernel better know what state the system is in: on
>>> battery or on AC power.
>>
>> That's a fundamentally uninteresting thing for the kernel to
>> know about. [...]
>
> I disagree.

and I'll agree with Matthew and disagree with you ;-)

>
>> [...] AC/battery is just not an important power management
>> policy input when compared to various other things.
>
> Such as?
>
> The thing is, when I use Linux on a laptop then AC/battery is
> *the* main policy input.

I think you're wrong there.
First of all, not the whole world is a laptop.
Phones and servers are very different than laptops in this sense.
In a phone, when you're charging, you want to be EXTRA power efficient in many ways
(since charging creates heat, and that heat will take away your thermal budget).
In a datacenter, you're either on AC or DC all the time, and power efficiency still matters.

And even on a laptop.. heat production matters even when on AC... laptops are more and more like phones
that way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote:
> * Matthew Garrett <[email protected]> wrote:
> > [...] AC/battery is just not an important power management
> > policy input when compared to various other things.
>
> Such as?

The scheduler's behaviour is going to have a minimal impact on power
consumption on laptops. Other things are much more important - backlight
level, ASPM state, that kind of thing. So why special case the
scheduler? This is going to be hugely more important on multi-socket
systems, where your policy is usually going to be dictated by the
specific workload that you're running at the time. The exception is in
cases where your rack is overcommitted for power and your rack
management unit is telling you to reduce power consumption since
otherwise it's going to have to cut the power to one of the machines in
the rack in the next few seconds.

> The thing is, when I use Linux on a laptop then AC/battery is
> *the* main policy input.

And it's already well handled from userspace, as it has to be.

> > Userspace has been doing a perfectly reasonable job of
> > determining policy here.
>
> Has it properly switched the scheduler's balancing between
> power-effient and performance-maximizing strategies when for
> example a laptop's AC got unplugged/replugged?

No, because sched_mt_powersave usually crippled performance more than it
saved power and nobody makes multi-socket laptops.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> > That's a fundamentally uninteresting thing for the kernel to
> > know about. [...]
>
> I disagree.

The kernel has no idea of the power architecture leading up to the plug
socket. The kernel has no idea of the policy concerns of the user.

> > [...] AC/battery is just not an important power management
> > policy input when compared to various other things.
>
> Such as?
>
> The thing is, when I use Linux on a laptop then AC/battery is
> *the* main policy input.

Along with distance likely to be travelled without a socket being
available, whether you remembered the charger, and a pile of other things
('can I get this built before Linus wakes up').

The kernel isn't capable of computing these other factors. The userspace
can at least make an educated guess,

In the business space its even more complicated because battery/mains may
well only be visible via SNMP queries to the power systems and the bigger
concern may well be heat efficiency. If you are running a cloud your
policy considerations also include things like your current spot
electricity price, outside temperature and your current spot compute price
chargeable.

> > Userspace has been doing a perfectly reasonable job of
> > determining policy here.
>
> Has it properly switched the scheduler's balancing between
> power-effient and performance-maximizing strategies when for
> example a laptop's AC got unplugged/replugged?

You work for Red Hat, maybe you should ask your distro people if they do.
While you are it at perhaps also some of the ATA power management that
will probably be an order of magnitude more significant could get
included ;)

Seriously. On a typical laptop the things you can do about power are
dominated by the backlight, by disk power (eg idle SATA links), by USB
device power downs where possible, by turning off any unused phys and by
not having the CPU wake up, which means fixing the desktop apps to behave
sensibly.

I'd like to see actual numbers and evidence on a wide range of workloads
the spread/don't spread thing is even measurable given that you've also
got to factor in effects like completing faster and turning everything
off. I'd *really* like to see such evidence on a laptop,which is your
one cited case it might work.

> > Your suggestions aren't a working default mechanism.
>
> In what way?

For one if the default behaviour is that when I get on the train and am
on battery my video playback begins to stutter due to some kernel
magic then I shall be unamused and file it as a regression.....

Policy is userspace - the desktop can figure out I'm watching movies and
what this means, the kernel can't.

I'd also note there have been repeated attempts to put power management
policy on various OS's by putting the power management policy

- in the hardware
- in SMM handlers
- in the kernel

and every single one has been a failure because those parts of the system
never have enough information nor do they have enough variety of control
to manage the complexity of input state.

It's a single policy file for a distro to do scheduler configuration
based upon power events. One trivial 'drop it here' shell script. The
difference then being the desktop can be taught to do overrides and
policy properly.

It might be the kernel has important knowledge about what "schedule
for efficiency" means and even to be able to ask the kernel to dot hat
- but it has no idea what the right policy is at any given moment.

ie even if there is a /sys/mumble/schedule_for_efficiency

the echo "1" > and echo "0" > belong in a script

Alan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
* Matthew Garrett <[email protected]> wrote:

> On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote:
> > * Matthew Garrett <[email protected]> wrote:
> > > [...] AC/battery is just not an important power management
> > > policy input when compared to various other things.
> >
> > Such as?
>
> The scheduler's behaviour is going to have a minimal impact on
> power consumption on laptops. Other things are much more
> important - backlight level, ASPM state, that kind of thing.
> So why special case the scheduler? [...]

I'm not special casing the scheduler - but we are talking about
scheduler policies here, so *if* it makes sense to handle this
dynamically then obviously the scheduler wants to use system
state information when/if the kernel can get it.

Your argument is as if you said that the shape of a car's side
view mirrors is not important to its top speed, because the
overall shape of the chassis and engine power are much more
important.

But we are desiging side view mirrors here, so we might as well
do a good job there.

> [...] This is going to be hugely more important on
> multi-socket systems, where your policy is usually going to be
> dictated by the specific workload that you're running at the
> time. [...]

If only we had some kernel subsystem that is intimiately familar
with the workloads running on the system and could act
accordingly and with low latency.

We could name that subsystem it in some intuitive fashion: it
switches and schedules workloads, so how about calling it the
'scheduler'?

> [...] The exception is in cases where your rack is
> overcommitted for power and your rack management unit is
> telling you to reduce power consumption since otherwise it's
> going to have to cut the power to one of the machines in the
> rack in the next few seconds.

( That must be some ACPI middleware driven crap, right? Not
really the Linux kernel's problem. )

> > The thing is, when I use Linux on a laptop then AC/battery
> > is *the* main policy input.
>
> And it's already well handled from userspace, as it has to be.

Not according to the developers switching away from Linux
desktop distros in droves, because MacOSX or Win7 has 30%+
better battery efficiency.

The scheduler might be a small part of the picture, but it's
certainly a part of it.

> > > Userspace has been doing a perfectly reasonable job of
> > > determining policy here.
> >
> > Has it properly switched the scheduler's balancing between
> > power-effient and performance-maximizing strategies when for
> > example a laptop's AC got unplugged/replugged?
>
> No, because sched_mt_powersave usually crippled performance
> more than it saved power and nobody makes multi-socket
> laptops.

That's a user-space policy management fail right there: why
wasn't this fixed? If the default policy is in the kernel we can
at least fix it in one place for the most common cases. If it's
spread out amongst multiple projects then progress only happens
at glacial speed ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote:
> * Matthew Garrett <[email protected]> wrote:
> > The scheduler's behaviour is going to have a minimal impact on
> > power consumption on laptops. Other things are much more
> > important - backlight level, ASPM state, that kind of thing.
> > So why special case the scheduler? [...]
>
> I'm not special casing the scheduler - but we are talking about
> scheduler policies here, so *if* it makes sense to handle this
> dynamically then obviously the scheduler wants to use system
> state information when/if the kernel can get it.
>
> Your argument is as if you said that the shape of a car's side
> view mirrors is not important to its top speed, because the
> overall shape of the chassis and engine power are much more
> important.
>
> But we are desiging side view mirrors here, so we might as well
> do a good job there.

If the kernel is going to make power choices automatically then it
should do it everywhere, not piecemeal.

> > [...] This is going to be hugely more important on
> > multi-socket systems, where your policy is usually going to be
> > dictated by the specific workload that you're running at the
> > time. [...]
>
> If only we had some kernel subsystem that is intimiately familar
> with the workloads running on the system and could act
> accordingly and with low latency.
>
> We could name that subsystem it in some intuitive fashion: it
> switches and schedules workloads, so how about calling it the
> 'scheduler'?

The scheduler is unaware of whether I care about a process finishing
quickly or whether I care about it consuming less power.

> > [...] The exception is in cases where your rack is
> > overcommitted for power and your rack management unit is
> > telling you to reduce power consumption since otherwise it's
> > going to have to cut the power to one of the machines in the
> > rack in the next few seconds.
>
> ( That must be some ACPI middleware driven crap, right? Not
> really the Linux kernel's problem. )

It's as much the Linux kernel's problem as AC/battery decisions are -
ie, it's not.

> > > The thing is, when I use Linux on a laptop then AC/battery
> > > is *the* main policy input.
> >
> > And it's already well handled from userspace, as it has to be.
>
> Not according to the developers switching away from Linux
> desktop distros in droves, because MacOSX or Win7 has 30%+
> better battery efficiency.

Ok so what you're actually telling me here is that you don't understand
anything about power management and where our problems are.

> The scheduler might be a small part of the picture, but it's
> certainly a part of it.

It's in the drivers, which is where it has been since we went tickless.

> > No, because sched_mt_powersave usually crippled performance
> > more than it saved power and nobody makes multi-socket
> > laptops.
>
> That's a user-space policy management fail right there: why
> wasn't this fixed? If the default policy is in the kernel we can
> at least fix it in one place for the most common cases. If it's
> spread out amongst multiple projects then progress only happens
> at glacial speed ...

sched_mt_powersave was inherently going to have a huge impact on
performance, and with modern chips that would result in the platform
consuming more power. It was a feature that was useful for a small
number of generations of desktop CPUs - I don't think it would ever skew
the power/performance ratio in a useful direction on mobile hardware.
But feel free to blame userspace for hardware design.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
* Matthew Garrett <[email protected]> wrote:

> On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote:
> > * Matthew Garrett <[email protected]> wrote:
> > > The scheduler's behaviour is going to have a minimal impact on
> > > power consumption on laptops. Other things are much more
> > > important - backlight level, ASPM state, that kind of thing.
> > > So why special case the scheduler? [...]
> >
> > I'm not special casing the scheduler - but we are talking about
> > scheduler policies here, so *if* it makes sense to handle this
> > dynamically then obviously the scheduler wants to use system
> > state information when/if the kernel can get it.
> >
> > Your argument is as if you said that the shape of a car's side
> > view mirrors is not important to its top speed, because the
> > overall shape of the chassis and engine power are much more
> > important.
> >
> > But we are desiging side view mirrors here, so we might as well
> > do a good job there.
>
> If the kernel is going to make power choices automatically
> then it should do it everywhere, not piecemeal.

Why? Good scheduling is useful even in isolation.

> The scheduler is unaware of whether I care about a process
> finishing quickly or whether I care about it consuming less
> power.

You are posing them as if the two were mutually exclusive, while
in reality they are not necessarily exclusive: it's quite
possible that the highest (non-turbo) CPU frequency happens to
be the most energy efficient one for a CPU with a particular
workload ...

You also missed the bit of my mail where I suggested that such
user preferences and tolerances can be communicated to the
scheduler via a policy toggle - which the scheduler would take
into account.

I suggest to use sane defaults, such as being energy efficient
on battery power (within a sane threshold) and maximizing
throughput on AC power (within a sane threshold).

That would go a *long* way improving the current mess. If Linux
power efficiency was so good today then I'd not ask for kernel
driven defaults - but the reality is that in terms of process
scheduling we suck today (and have sucked for the last 10 years)
so pretty much any approach will improve things.

> > > > The thing is, when I use Linux on a laptop then
> > > > AC/battery is *the* main policy input.
> > >
> > > And it's already well handled from userspace, as it has to
> > > be.
> >
> > Not according to the developers switching away from Linux
> > desktop distros in droves, because MacOSX or Win7 has 30%+
> > better battery efficiency.
>
> Ok so what you're actually telling me here is that you don't
> understand anything about power management and where our
> problems are.

Huh? In practice we suck today in terms of energy efficiency.
That covers both scheduling and other areas.

Saying this out aloud does not tell anything about my
understanding of power management...

So please outline a technical point.

> > The scheduler might be a small part of the picture, but it's
> > certainly a part of it.
>
> It's in the drivers, which is where it has been since we went
> tickless.

You mean the code is in drivers? Or the problem is in drivers?

Both is true currently - this discussion is about the future, to
make the scheduler aware of power concerns, as the scheduler
(and the timer subsystem) already calculates various interesting
metrics that matter to energy efficient scheduling.

> > > No, because sched_mt_powersave usually crippled performance
> > > more than it saved power and nobody makes multi-socket
> > > laptops.
> >
> > That's a user-space policy management fail right there: why
> > wasn't this fixed? If the default policy is in the kernel we can
> > at least fix it in one place for the most common cases. If it's
> > spread out amongst multiple projects then progress only happens
> > at glacial speed ...
>
> sched_mt_powersave was inherently going to have a huge impact
> on performance, and with modern chips that would result in the
> platform consuming more power. It was a feature that was
> useful for a small number of generations of desktop CPUs - I
> don't think it would ever skew the power/performance ratio in
> a useful direction on mobile hardware. But feel free to blame
> userspace for hardware design.

FYI, sched_mt_powersave is *GONE* in recent kernels, because it
basically never worked. This thread is about designing and
implementing something that actually works.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Tue, Aug 21, 2012 at 08:23:46PM +0200, Ingo Molnar wrote:
> * Matthew Garrett <[email protected]> wrote:
> > The scheduler is unaware of whether I care about a process
> > finishing quickly or whether I care about it consuming less
> > power.
>
> You are posing them as if the two were mutually exclusive, while
> in reality they are not necessarily exclusive: it's quite
> possible that the highest (non-turbo) CPU frequency happens to
> be the most energy efficient one for a CPU with a particular
> workload ...

You just put in a proviso that makes them mutually exclusive. If I want
it done fast, I want it done in the highest turbo CPU frequency. If I
don't want it done fast, I want it done in the most efficient CPU
frequency. They're probably not the same thing.

> You also missed the bit of my mail where I suggested that such
> user preferences and tolerances can be communicated to the
> scheduler via a policy toggle - which the scheduler would take
> into account.

Yes. And that toggle should be the thing that defines the policy under
all circumstances.

> > Ok so what you're actually telling me here is that you don't
> > understand anything about power management and where our
> > problems are.
>
> Huh? In practice we suck today in terms of energy efficiency.
> That covers both scheduling and other areas.
>
> Saying this out aloud does not tell anything about my
> understanding of power management...
>
> So please outline a technical point.

Our power consumption is worse than under other operating systems is
almost entirely because only one of our three GPU drivers implements any
kind of useful power management. The power saving functionality that we
expose to userspace is already used when it's safe to do so. So blaming
our userspace policy management for our higher power consumption means
that you can't possibly understand where the problems actually are,
which indicates that you probably shouldn't be trying to tell me about
optimal approaches to power management.

> You mean the code is in drivers? Or the problem is in drivers?

The problem is in the drivers.

> > sched_mt_powersave was inherently going to have a huge impact
> > on performance, and with modern chips that would result in the
> > platform consuming more power. It was a feature that was
> > useful for a small number of generations of desktop CPUs - I
> > don't think it would ever skew the power/performance ratio in
> > a useful direction on mobile hardware. But feel free to blame
> > userspace for hardware design.
>
> FYI, sched_mt_powersave is *GONE* in recent kernels, because it
> basically never worked. This thread is about designing and
> implementing something that actually works.

Yes. You asked me whether userspace ever used the knobs that the kernel
exposed. I said no, because the only knob relevant for laptops would
never improve energy efficiency on laptops. It is therefore impossible
to use this as an example of userspace policy management not doing the
right thing.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> Why? Good scheduling is useful even in isolation.

For power - I suspect it's damn near irrelevant except on a big big
machine.

Unless you've sorted out your SATA, fixed your phy handling, optimised
your desktop for wakeups and worked down the big wakeup causes one by one
it's turd polishing.

PM means fixing the stack top to bottom, and its a whackamole game, each
one you fix you find the next. You have to sort the entire stack from
desktop apps to kernel.

However benchmarks talk - so lets have some benchmarks ... on a laptop.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:

> I'd like to see actual numbers and evidence on a wide range of workloads
> the spread/don't spread thing is even measurable given that you've also
> got to factor in effects like completing faster and turning everything
> off. I'd *really* like to see such evidence on a laptop,which is your
> one cited case it might work.

For my dinky dual core laptop, I suspect you're right, but for a more
powerful laptop, I'd expect spread/don't to be noticeable.

Yeah, hard numbers would be nice to see.

If I had a powerful laptop, I'd kill irq balancing, and all but periodic
load balancing, and expect to see a positive result. Dunno what fickle
electron gods would _really_ do with those prayers though.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
* Alan Cox <[email protected]> wrote:

> > Why? Good scheduling is useful even in isolation.
>
> For power - I suspect it's damn near irrelevant except on a
> big big machine.

With deep enough C states it's rather relevant whether we
continue to burn +50W for a couple of more milliseconds or not,
and whether we have the right information from the scheduler and
timer subsystem about how long the next idle period is expected
to be and how bursty a given task is.

'Balance for energy efficiency' obviously ties into the C state
and frequency selection logic, which is rather detached right
now, running its own (imperfect) scheduling metrics logic and
doing pretty much the worst possible C state and frequency
decisions in typical everyday desktop workloads.

> Unless you've sorted out your SATA, fixed your phy handling,
> optimised your desktop for wakeups and worked down the big
> wakeup causes one by one it's turd polishing.
>
> PM means fixing the stack top to bottom, and its a whackamole
> game, each one you fix you find the next. You have to sort the
> entire stack from desktop apps to kernel.

Moving 'policy' into user-space has been an utter failure,
mostly because there's not a single project/subsystem
responsible for getting a good result to users. This is why
I resist "policy should not be in the kernel" meme here.

> However benchmarks talk - so lets have some benchmarks ... on
> a laptop.

Agreed.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
* Matthew Garrett <[email protected]> wrote:

> [...]
>
> Our power consumption is worse than under other operating
> systems is almost entirely because only one of our three GPU
> drivers implements any kind of useful power management. [...]

.... and because our CPU frequency and C state selection logic is
doing pretty much the worst possible decisions (on x86 at
least).

Regardless, you cannot possibly seriously suggest that because
there's even greater suckage elsewhere for some workloads we
should not even bother with improving the situation here.

Anyway, I agree with Alan that actual numbers matter.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> With deep enough C states it's rather relevant whether we
> continue to burn +50W for a couple of more milliseconds or not,
> and whether we have the right information from the scheduler and
> timer subsystem about how long the next idle period is expected
> to be and how bursty a given task is.

50W for 2mS here and there is an irrelevance compared with burning a
continual half a watt due to the upstream tree lack some of the SATA
power patches for example.

It's the classic "standby mode" problem - energy efficiency has time as a
factor and there are a lot of milliseconds in 5 hours. That means
anything continually on rapidly dominates the problem space.

> > PM means fixing the stack top to bottom, and its a whackamole
> > game, each one you fix you find the next. You have to sort the
> > entire stack from desktop apps to kernel.
>
> Moving 'policy' into user-space has been an utter failure,
> mostly because there's not a single project/subsystem
> responsible for getting a good result to users. This is why
> I resist "policy should not be in the kernel" meme here.

You *can't* fix PM in one place. Power management is a top to bottom
thing. It starts in the hardware and propogates right to the top of the
user space stack.

A single stupid behaviour in a desktop app is all it needs to knock the
odd hour or two off your battery life. Something is mundane as refreshing
a bit of the display all the time keeping the GPU and CPU from sleeping
well.

Most distros haven't managed to do power management properly because it
is this entire integration problem. Every single piece of the puzzle has
to be in place before you get any serious gain.

It's not a kernel v user thing. The kernel can't fix it, random bits of
userspace can't fix it. This is effectively a "product level" integration
problem.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
* Alan Cox <[email protected]> wrote:

> > With deep enough C states it's rather relevant whether we
> > continue to burn +50W for a couple of more milliseconds or
> > not, and whether we have the right information from the
> > scheduler and timer subsystem about how long the next idle
> > period is expected to be and how bursty a given task is.
>
> 50W for 2mS here and there is an irrelevance compared with
> burning a continual half a watt due to the upstream tree lack
> some of the SATA power patches for example.

It can be more than an irrelevance if the CPU is saturated - say
a game running on a mobile device very commonly saturates the
CPU. A third of the energy is spent in the CPU, sometimes more.

> It's the classic "standby mode" problem - energy efficiency
> has time as a factor and there are a lot of milliseconds in 5
> hours. That means anything continually on rapidly dominates
> the problem space.
>
> > > PM means fixing the stack top to bottom, and its a whackamole
> > > game, each one you fix you find the next. You have to sort the
> > > entire stack from desktop apps to kernel.
> >
> > Moving 'policy' into user-space has been an utter failure,
> > mostly because there's not a single project/subsystem
> > responsible for getting a good result to users. This is why
> > I resist "policy should not be in the kernel" meme here.
>
> You *can't* fix PM in one place. [...]

Preferably one project, not one place - but at least don't go
down the false path of:

" Policy always belongs into user-space so the kernel can
continue to do a shitty job even for pieces it could
understand better ..."

My opinion is that it depends, and I also think that we are so
bad currently (on x86) that we can do little harm by trying to
do things better.

> [...] Power management is a top to bottom thing. It starts in
> the hardware and propogates right to the top of the user space
> stack.

Partly because it's misdesigned: in practice there's very little
true user policy about power saving:

- On mobile devices I almost never tweak policy as a user -
sometimes I override screen brightness but that's all (and
it's trivial compared to all the many other things that go
on).

- On a laptop I'd love to never have to tweak it either -
running fast when on AC and running efficient when on battery
is a perfectly fine life-time default for me.

90% of the "policy" comes with the *form factor* - i.e. it's
something the hardware and thus the kernel could intimately
know about.

Yes, there are exceptions and there are servers.

The mobile device user mostly *only cares about battery life*,
for a given amount of real utility provided by the device. The
"user policy" fetish here is a serious misunderstanding of how
it should all work. There arent millions of people out there
wanting to tweak the heck out of PM.

People prefer no knobs at all - they want good defaults and they
want at most a single, intuitive, actionable control to override
the automation in 1% of the usecases, such as screen brightness.

> A single stupid behaviour in a desktop app is all it needs to
> knock the odd hour or two off your battery life. Something is
> mundane as refreshing a bit of the display all the time
> keeping the GPU and CPU from sleeping well.

Even with highly powertop-optimized systems that have no such
app and have very low wakeup rates we still lag behind the
competition.

> Most distros haven't managed to do power management properly
> because it is this entire integration problem. Every single
> piece of the puzzle has to be in place before you get any
> serious gain.

Most certainly.

So why not move most pieces into one well-informed code domain
(the kernel) and only expose high level controls, instead of
expecting user-space to get it all right.

Then the 'only' job of user-space would be to not be silly when
implementing their functionality. (and there's nothing
intimately PM about that.)

> It's not a kernel v user thing. The kernel can't fix it,
> random bits of userspace can't fix it. This is effectively a
> "product level" integration problem.

Of course the kernel can fix many parts by offering automation
like automatically shutting down unused interfaces (and offering
better ABIs if that is not possible due to some poor historic
choice), choosing frequencies and C states wisely, etc.

Kernel design decisions *matter*:

Look for example how moving X lowlevel drivers from user-space
into kernel-space enabled GPU level power management to begin
with. With the old X method it was essentially impossible. Now
it's at least possible.

Or look at how Android adding a high-level interface like
suspend blockers materially improved the power saving situation
for them.

This learned helplessness that "the kernel can do nothing about
PM" is somewhat annoying :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed, Aug 22, 2012 at 11:10:13AM +0200, Ingo Molnar wrote:
>
> * Matthew Garrett <[email protected]> wrote:
>
> > [...]
> >
> > Our power consumption is worse than under other operating
> > systems is almost entirely because only one of our three GPU
> > drivers implements any kind of useful power management. [...]
>
> ... and because our CPU frequency and C state selection logic is
> doing pretty much the worst possible decisions (on x86 at
> least).

You have figures showing that our C state residence is worse than, say,
Windows? Because my own testing says that we're way better at that.
Could we be better? Sure. Is it why we're worse? No.

> Regardless, you cannot possibly seriously suggest that because
> there's even greater suckage elsewhere for some workloads we
> should not even bother with improving the situation here.

I'm enthusiastic about improving the scheduler's behaviour. I'm
unenthusiastic about putting in automatic hacks related to AC state.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> It can be more than an irrelevance if the CPU is saturated - say
> a game running on a mobile device very commonly saturates the
> CPU. A third of the energy is spent in the CPU, sometimes more.

If the CPU is saturated you already lost. What you going to do - the CPU
is saturated - slow it down, then it'll use more power.

> > You *can't* fix PM in one place. [...]
>
> Preferably one project, not one place - but at least don't go
> down the false path of:
>
> " Policy always belongs into user-space so the kernel can
> continue to do a shitty job even for pieces it could
> understand better ..."
>
> My opinion is that it depends, and I also think that we are so
> bad currently (on x86) that we can do little harm by trying to
> do things better.

All the evidence I've seen says we are doing the kernel side stuff right.

>
> > [...] Power management is a top to bottom thing. It starts in
> > the hardware and propogates right to the top of the user space
> > stack.
>
> Partly because it's misdesigned: in practice there's very little
> true user policy about power saving:

It's not about policy, its about code behaviour. You have to fix every
single piece of code.

> - On mobile devices I almost never tweak policy as a user -
> sometimes I override screen brightness but that's all (and
> it's trivial compared to all the many other things that go
> on).

Put a single badly broken app on an Android device and your battery life
will plough. That's despite Android having some highly active management
policies to minimise the effect. It works out of the box because someone
spent a huge amount of time with a power meter and monitoring tools
beating up whoever was top of the wakeup lists.

> it should all work. There arent millions of people out there
> wanting to tweak the heck out of PM.

Don't confuse policy managed by the userspace and buttons for users to
tweak. Userspace understands things like "would it be better to drop
video quality or burn more power" and has access to info the kernel can't
even begin to evaluate.

> People prefer no knobs at all - they want good defaults and they
> want at most a single, intuitive, actionable control to override
> the automation in 1% of the usecases, such as screen brightness.

That's a different discussion.

> > A single stupid behaviour in a desktop app is all it needs to
> > knock the odd hour or two off your battery life. Something is
> > mundane as refreshing a bit of the display all the time
> > keeping the GPU and CPU from sleeping well.
>
> Even with highly powertop-optimized systems that have no such
> app and have very low wakeup rates we still lag behind the
> competition.

Actually we don't. Well not if your distro is put together properly,
and has the relevant SATA patches and the like merged. Stock Fedora may
be pants but if so that's a distro problem.

> So why not move most pieces into one well-informed code domain
> (the kernel) and only expose high level controls, instead of
> expecting user-space to get it all right.

Because the kernel doesn't have the information needed. You'd have to add
megabytes of code to the kernel - including things like video playback
engines.

> Then the 'only' job of user-space would be to not be silly when
> implementing their functionality. (and there's nothing
> intimately PM about that.)

That sounds like ignorance

> Kernel design decisions *matter*:

Of course they do but its a tiny part of the story. The power management
function mathematically has a large number of important inputs for which
the kernel cannot deduce the values without massive layering violations.

Also inconveniently for your worldview but as demonstrated in every case
and by everyone who has dug into it, you also have to fix all the wakeup
sources on each level. That's the reality. From the moment you wake for
an event that was not strictly needed you are essentially attempting to
mitigate a failure not trying to deal with the actual problem.

> Look for example how moving X lowlevel drivers from user-space
> into kernel-space enabled GPU level power management to begin
> with. With the old X method it was essentially impossible. Now
> it's at least possible.

Actually it was perfectly possible before for what the cards of the time
could do. The kernel GPU stuff is for DMA and IRQ handling. It happens to
be a good place to do PM.

> Or look at how Android adding a high-level interface like
> suspend blockers materially improved the power saving situation
> for them.

Blockers are not policy. The blocking *policy* is managed elsewhere. They
are a tool for freezing stuff that is being rude.

> This learned helplessness that "the kernel can do nothing about
> PM" is somewhat annoying :-)

Sorry was that a different thread I didn't read ?

The inability to learn from both the past and basic systems theory is
what I find rather more irritating. Plus your mistaken belief that we are
worse than the other OS's on this. We are not. If your system sucks then
instrument it, get the SATA patches into your kernel, run powertweak over
it and ask your distro folks why you had to change any of the settings
and why they hadn't shipped it that way.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:
>
>> I'd like to see actual numbers and evidence on a wide range of workloads
>> the spread/don't spread thing is even measurable given that you've also
>> got to factor in effects like completing faster and turning everything
>> off. I'd *really* like to see such evidence on a laptop,which is your
>> one cited case it might work.
>
> For my dinky dual core laptop, I suspect you're right, but for a more
> powerful laptop, I'd expect spread/don't to be noticeable.

yeah if you don't spread, you will waste some power.
but.. current linux behavior is to spread.
so we can only make it worse.


>
> Yeah, hard numbers would be nice to see.
>
> If I had a powerful laptop, I'd kill irq balancing, and all but periodic
> load balancing, and expect to see a positive result.

I'd expect to see a negative result ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed, 2012-08-22 at 06:02 -0700, Arjan van de Ven wrote:
> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> > On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:
> >
> >> I'd like to see actual numbers and evidence on a wide range of workloads
> >> the spread/don't spread thing is even measurable given that you've also
> >> got to factor in effects like completing faster and turning everything
> >> off. I'd *really* like to see such evidence on a laptop,which is your
> >> one cited case it might work.
> >
> > For my dinky dual core laptop, I suspect you're right, but for a more
> > powerful laptop, I'd expect spread/don't to be noticeable.
>
> yeah if you don't spread, you will waste some power.
> but.. current linux behavior is to spread.
> so we can only make it worse.

Hm, so I can stop fretting about select_idle_sibling(). Good.

> > Yeah, hard numbers would be nice to see.
> >
> > If I had a powerful laptop, I'd kill irq balancing, and all but periodic
> > load balancing, and expect to see a positive result.
>
> I'd expect to see a negative result ;-)

Ok, so I have my head on backward. Gives a different perspective :)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote:
> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> > For my dinky dual core laptop, I suspect you're right, but for a more
> > powerful laptop, I'd expect spread/don't to be noticeable.
>
> yeah if you don't spread, you will waste some power.
> but.. current linux behavior is to spread.
> so we can only make it worse.

Right. For a single socket system the only thing you can do is use two
threads in preference to using two cores. That'll keep an extra core in
a deep C state for longer, at the cost of keeping the package out of a
deep C state for longer. There might be a win if the two processes
benefit from improved L1 cache locality, or if you're talking about
short periodic work, but for the majority of cases I'd expect Arjan to
be completely correct here. Things get more interesting with
multi-socket systems, but that's beyond the laptop use case.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 8/22/2012 6:21 AM, Matthew Garrett wrote:
> On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote:
>> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
>>> For my dinky dual core laptop, I suspect you're right, but for a more
>>> powerful laptop, I'd expect spread/don't to be noticeable.
>>
>> yeah if you don't spread, you will waste some power.
>> but.. current linux behavior is to spread.
>> so we can only make it worse.
>
> Right. For a single socket system the only thing you can do is use two
> threads in preference to using two cores. That'll keep an extra core in
> a deep C state for longer, at the cost of keeping the package out of a
> deep C state for longer. There might be a win if the two processes
> benefit from improved L1 cache locality, or if you're talking about

basically "if HT sharing would be good for performance" ;-)

(btw this is good news, it means this is not an actual power/performance tradeoff, but a "get it right" tradeoff)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/22/2012 05:10 PM, Ingo Molnar wrote:

>
> * Matthew Garrett <[email protected]> wrote:
>
>> [...]
>>
>> Our power consumption is worse than under other operating
>> systems is almost entirely because only one of our three GPU
>> drivers implements any kind of useful power management. [...]
>
> ... and because our CPU frequency and C state selection logic is
> doing pretty much the worst possible decisions (on x86 at
> least).
>
> Regardless, you cannot possibly seriously suggest that because
> there's even greater suckage elsewhere for some workloads we
> should not even bother with improving the situation here.
>
> Anyway, I agree with Alan that actual numbers matter.


Sure. we'd better make ideas into code, and then let benchmarks and data
speaking.

>
> Thanks,
>
> Ingo


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Sorry, only registered users may post in this forum.

Click here to login