Welcome! Log In Create A New Profile

Advanced

[discussion]sched: a rough proposal to enable power saving in scheduler

Posted by Alex Shi 
On Thu, Aug 16, 2012 at 01:03:32PM +0800, Alex Shi wrote:
> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
> > Are there workloads in which "power" might provide more performance than
> > "performance"? If so, don't use these terms.
>
> Power scheme should no chance has better performance in design.

Power will tend to concentrate processes on packages, while performance
will tend to split them across packages? What if two cooperating
processes gain from being on the same package and sharing cache
locality?

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/16/2012 01:31 PM, Matthew Garrett wrote:

> On Thu, Aug 16, 2012 at 01:03:32PM +0800, Alex Shi wrote:
>> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>>> Are there workloads in which "power" might provide more performance than
>>> "performance"? If so, don't use these terms.
>>
>> Power scheme should no chance has better performance in design.
>
> Power will tend to concentrate processes on packages,


yes.

while performance
> will tend to split them across packages?


No, there is still has balance idea in this rough proposal. If a domain
is not overload, it is better to left old tasks unchanged. I should say,
current scheduler is the 'performance' trend scheme.

What if two cooperating
> processes gain from being on the same package and sharing cache
> locality?
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Thu, Aug 16, 2012 at 01:39:36PM +0800, Alex Shi wrote:
> On 08/16/2012 01:31 PM, Matthew Garrett wrote:
> > will tend to split them across packages?
>
>
> No, there is still has balance idea in this rough proposal. If a domain
> is not overload, it is better to left old tasks unchanged. I should say,
> current scheduler is the 'performance' trend scheme.

The current process isn't necessarily ideal for all workloads - that's
one of the reasons for letting workspace modify process affinity. I
agree that the "performance" mode will tend to provide better
performance than the "power" mode for an arbitrary workload, but if
there are workloads that would perform better in "power" then it's a
poor naming scheme.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Hi everyone,

From what I have understood so far,I try to summarise pin pointed
differences between the performance and power policies as found
relevant to the scheduler-load balancing mechanism.Any thoughts?

*Performance policy*:

Q1.Who triggers load_balance?
Load balance is triggered when a cpu is found to be idle.(Pull mechanism)

Q2.How is load_balance handled?
When triggered,the load is looked to be pulled from its sched domain.
First the sched groups in the domain the cpu belongs to is queried
followed by the runqueues in the busiest group.then the tasks are moved.

This course of action is found analogous to the performance policy because:

1.First the idle cpu initiates the pull action
2.The busiest cpu hands over the load to this cpu.A person who can
handle any work is querying as to who cannot handle more work.

*Power policy*:

So how is power policy different? As Peter says,'pack more than spread
more'.

Q1.Who triggers load balance?
It is the cpu which cannot handle more work.Idle cpu is left to remain
idle.(Push mechanism)

Q2.How is load_balance handled?
First the least busy runqueue,from within the sched_group that the busy
cpu belongs to is queried.if none exist,ie all the runqueues are equally
busy then move on to the other sched groups.

Here again the 'least busy' policy should be applied,first at
group level then at the runqueue level.

This course of action is found analogous to the power policy because as
much as possible busy and capable cpus within a small range try to
handle the existing load.

Regards
Preeti


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 8/16/12, Alex Shi <[email protected]> wrote:
> On 08/15/2012 10:24 PM, Rakib Mullick wrote:
>
>> On 8/13/12, Alex Shi <[email protected]> wrote:
>>> Since there is no power saving consideration in scheduler CFS, I has a
>>> very rough idea for enabling a new power saving schema in CFS.
>>>
>>> It bases on the following assumption:
>>> 1, If there are many task crowd in system, just let few domain cpus
>>> running and let other cpus idle can not save power. Let all cpu take the
>>> load, finish tasks early, and then get into idle. will save more power
>>> and have better user experience.
>>>
>> This assumption indirectly point towards the scheme when performance
>> is enabled, isn't it? Cause you're trying to spread the load equally
>> amongst all the CPUs.
>
>
> It is.
>
Okay, then what would be the default mechanism? Performance or
powersavings ? Your proposal deals with performance and power saving,
but there should be a default mechanism too, what that default
mechanism would be? Shouldn't performance be the default one and
discard checking for performance?

>>
>>>
>>> select_task_rq_fair()
>>> {
>
> int powersaving = 0;
>
>>> for_each_domain(cpu, tmp) {
>>> if (policy == power && tmp_has_capacity &&
>>> tmp->flags & sd_flag) {
>>> sd = tmp;
>>> //It is fine to got cpu in the domain
>
> powersaving = 1;
>
>>> break;
>>> }
>>> }
>>>
>>> while(sd) {
> if (policy == power && powersaving == 1)
>>> find_busiest_and_capable_group()
>>
>> I'm not sure what find_busiest_and_capable_group() would really be, it
>> seems it'll find the busiest and capable group, but isn't it a
>> conflict with the first assumption you proposed on your proposal?
>
>
> This pseudo code missed a power saving workable flag , adding it into
> above code should solved your concern.
>
I think I should take a look at this one when it'll be prepared for RFC.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/16/2012 02:53 PM, preeti wrote:

>
> Hi everyone,
>
> From what I have understood so far,I try to summarise pin pointed
> differences between the performance and power policies as found
> relevant to the scheduler-load balancing mechanism.Any thoughts?


Currently, the load_balance trigger will be called in timer for periodic
tick, or dynamic tick.
In periodic tick, the cpu is waked, so do load_balance is not cost much.
But in dynamic tick. we'd better do power policy suitable scenario
checking in nohz_kick_needed(), and then do nohz_balancer_kick on least
but non-idle cpu if possible. that reduce the idle cpu waking chance.

Any comments?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Thu, Aug 16, 2012 at 12:23 PM, preeti <[email protected]> wrote:
>
> Hi everyone,
>
> From what I have understood so far,I try to summarise pin pointed
> differences between the performance and power policies as found
> relevant to the scheduler-load balancing mechanism.Any thoughts?
>
> *Performance policy*:
>
> Q1.Who triggers load_balance?
> Load balance is triggered when a cpu is found to be idle.(Pull mechanism)
>
> Q2.How is load_balance handled?
> When triggered,the load is looked to be pulled from its sched domain.
> First the sched groups in the domain the cpu belongs to is queried
> followed by the runqueues in the busiest group.then the tasks are moved.
>
> This course of action is found analogous to the performance policy because:
>
> 1.First the idle cpu initiates the pull action
> 2.The busiest cpu hands over the load to this cpu.A person who can
> handle any work is querying as to who cannot handle more work.
>
> *Power policy*:
>
> So how is power policy different? As Peter says,'pack more than spread
> more'.
>
> Q1.Who triggers load balance?
> It is the cpu which cannot handle more work.Idle cpu is left to remain
> idle.(Push mechanism)
>
> Q2.How is load_balance handled?
> First the least busy runqueue,from within the sched_group that the busy
> cpu belongs to is queried.if none exist,ie all the runqueues are equally
> busy then move on to the other sched groups.
>
> Here again the 'least busy' policy should be applied,first at
> group level then at the runqueue level.
>
> This course of action is found analogous to the power policy because as
> much as possible busy and capable cpus within a small range try to
> handle the existing load.
>
Not to complicate the power policy scheme but always *packing* may
not be the best approach for all CPU packages. As mentioned, packing
ensures that least number of power domains are in use and effectively
reduce the active power consumption on paper but there are few
considerations which might conflict with this assumption.

-- Many architectures get best power saving when the entire CPU cluster
or SD is idle. Intel folks already mentioned this and also extended this
concept for attached memory with the CPU domain from self refresh point
of view. This is true for the CPUs who have very little active leakage and
hence "race to idle" would be better so that cluster can hit the deeper
C-state to save more power.

-- Spreading vs Packing actually can be made OPP(CPU operating point)
dependent. Some of the mobile workload and power numbers measured in
the past shown that when CPU operating at lower OPP(considering the
load is less),
packing would be the best option to have higher opportunity for cluster to idle
where as while operating at higher operating point(assuming the higher CPU load
and possibly more threads), a spread with race to idle in mind might
be beneficial.
Of-course this is going to be bit messy since the CPUFreq and scheduler needs
to be linked.

-- May be this is already possible but for architectures like big.LITTLE,
the power consumption and active leakage can be significantly different
across big and little CPU packages.
Meaning the big CPU cluster or SD might be more power efficient
with packing where as Little CPU cluster would be power efficient
with spreading. Hence the possible need of per SD configurability.

Ofcourse all of this can be done step by step starting with most simple
power policy as stated by Peter.

Regards
Santosh











been used whenever possible in sd.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 8/15/2012 10:03 PM, Alex Shi wrote:
> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>
>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Are there workloads in which "power" might provide more performance than
>> "performance"? If so, don't use these terms.
>>
>
>
> Power scheme should no chance has better performance in design.

ehm.....

so in reality, the very first thing that helps power, is to run software efficiently.

anything else is completely secondary.

if placement policy leads to a placement that's different from the most efficient placement,
you're already burning extra power...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> *Power policy*:
>
> So how is power policy different? As Peter says,'pack more than spread
> more'.

this is ... a dubiously general statement.

for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.

the only thing you do not want to do, is wake cpus up for
tasks that only run extremely briefly (think "100 usec" or less).

so maybe the balance interval is slightly different, or more, you don't balance tasks that
historically ran only for brief periods


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Hi all,

On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote:
> >
> > sub proposal:
> > 1, If it's possible to balance task on idlest cpu not appointed 'balance
> > cpu'. If so, it may can reduce one more time balancing.
> > The idlest cpu can prefer the new idle cpu; and is the least load cpu;
> > 2, se or task load is good for running time setting.
> > but it should the second basis in load balancing. The first basis of LB
> > is running tasks' number in group/cpu. Since whatever of the weight of
> > groups is, if the tasks number is less than cpu number, the group is
> > still has capacity to take more tasks. (will consider the SMT cpu power
> > or other big/little cpu capacity on ARM.)
>
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.
>
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.
>
> I think some of the Linaro people actually played around with this,
> Vincent?
>

I agree. A better measure of cpu load and task weight than nr_running
and the current task load weight are necessary to do proper task
packing.

I have used PJTs per-task load-tracking for scheduling experiments on
heterogeneous systems and my experience is that it works quite well for
determining the load of a specific task. Something like PJTs work
would be a good starting point for power aware scheduling and better
support for heterogeneous systems.

One of the biggest challenges here for load-balancing is translating
task load from one cpu to another as the task load is influenced by the
total load of its cpu. So a task that appears to be heavy on an
oversubscribed cpu might not be so heavy after all when it is moved to a
cpu with plenty cpu time to spare. This issue is likely to be more
pronounced on heterogeneous systems and system with aggressive frequency
scaling. It might be possible to avoid having to translate load or that
it doesn't really matter, but I haven't completely convinced myself yet.

My point is that getting the task load right or at least better is a
fundamental requirement for improving power aware scheduling.

Best regards,
Morten

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/16/2012 10:01 AM, Arjan van de Ven wrote:
>> *Power policy*:
>>
>> So how is power policy different? As Peter says,'pack more than spread
>> more'.
>
> this is ... a dubiously general statement.
>
> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
>
> the only thing you do not want to do, is wake cpus up for
> tasks that only run extremely briefly (think "100 usec" or less).
>
> so maybe the balance interval is slightly different, or more, you don't balance tasks that
> historically ran only for brief periods

This makes me think that maybe, in addition to tracking
the idle residency time in the c-state governor, we may
also want to track the average run times in the scheduler.

The c-state governor can call the scheduler code before
putting a CPU to sleep, to indicate (1) the wakeup latency
of the CPU, and (2) whether TLB and/or cache get invalidated.

At wakeup time, the scheduler can check whether the CPU
the to-be-woken process ran on is in a deeper sleep state,
and whether the typical run time for the process significantly
exceeds the wakeup latency of the CPU it last ran on.

If the process typically runs for a short interval, and/or
the process's CPU lost its cached state, it may be better
to run the just-woken task on the CPU that is doing the
waking up, instead of on the CPU where it used to run.

Does that make sense?

Am I overlooking any factors?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 8/16/2012 11:45 AM, Rik van Riel wrote:
>
> The c-state governor can call the scheduler code before
> putting a CPU to sleep, to indicate (1) the wakeup latency
> of the CPU, and (2) whether TLB and/or cache get invalidated.

I don't think (2) is useful really; that basically always happens ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/16/2012 10:01 PM, Arjan van de Ven wrote:

>> *Power policy*:
>>
>> So how is power policy different? As Peter says,'pack more than spread
>> more'.
>
> this is ... a dubiously general statement.
>
> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
>
> the only thing you do not want to do, is wake cpus up for
> tasks that only run extremely briefly (think "100 usec" or less).


It's a very important and valuable info!
Just want to know how you get this? From CS cost or cache/TLB refill cost?

>
> so maybe the balance interval is slightly different, or more, you don't balance tasks that
> historically ran only for brief periods
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.
>
>> 2, schedule domain, schedule group perfect match the hardware, and
>> the power consumption unit. So, pull tasks out of a domain means
>> potentially this power consumption unit idle.
>
> I'm not sure I understand what you're saying, sorry.
>
>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.
> But this seems like a later concern, you have to have something to pick
> between before you can pick :-)
>
>> And in scheduling, 2 place will care the policy, load_balance() and in
>> task fork/exec: select_task_rq_fair().
>
> ack
>
>> Here is some pseudo code try to explain the proposal behaviour in
>> load_balance() and select_task_rq_fair();
>
> Oh man.. A few words outlining the general idea would've been nice.
>
>> load_balance() {
>> update_sd_lb_stats(); //get busiest group, idlest group data.
>>
>> if (sd->nr_running > sd's capacity) {
>> //power saving policy is not suitable for
>> //this scenario, it runs like performance policy
>> mv tasks from busiest cpu in busiest group to
>> idlest cpu in idlest group;
>
> Once upon a time we talked about adding a factor to the capacity for
> this. So say you'd allow 2*capacity before overflowing and waking
> another power group.
>
> But I think we should not go on nr_running here, PJTs per-entity load
> tracking stuff gives us much better measures -- also, repost that series
> already Paul! :-)

Yes -- I just got back from Africa this week. It's updated for almost
all the previous comments but I ran out of time before I left to
re-post. I'm just about caught up enough that I should be able to get
this done over the upcoming weekend. Monday at the latest.

>
> Also, I'm not sure this is entirely correct, the thing you want to do
> for power aware stuff is to minimize the number of active power domains,
> this means you don't want idlest, you want least busy non-idle.
>
>> } else {// the sd has enough capacity to hold all tasks.
>> if (sg->nr_running > sg's capacity) {
>> //imbalanced between groups
>> if (schedule policy == performance) {
>> //when 2 busiest group at same busy
>> //degree, need to prefer the one has
>> // softest group??
>> move tasks from busiest group to
>> idletest group;
>
> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.
>
>> } else if (schedule policy == power)
>> move tasks from busiest group to
>> idlest group until busiest is just full
>> of capacity.
>> //the busiest group can balance
>> //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains. The old code (that got deleted) did
> something like that, you can revive some of the that code if needed -- I
> just killed everything to be able to start with a clean slate.
>
>
>> } else {
>> //all groups has enough capacity for its tasks.
>> if (schedule policy == performance)
>> //all tasks may has enough cpu
>> //resources to run,
>> //mv tasks from busiest to idlest group?
>> //no, at this time, it's better to keep
>> //the task on current cpu.
>> //so, it is maybe better to do balance
>> //in each of groups
>> for_each_imbalance_groups()
>> move tasks from busiest cpu to
>> idlest cpu in each of groups;
>> else if (schedule policy == power) {
>> if (no hard pin in idlest group)
>> mv tasks from idlest group to
>> busiest until busiest full.
>> else
>> mv unpin tasks to the biggest
>> hard pin group.
>> }
>> }
>> }
>> }
>
> OK, so you only start to group later.. I think we can do better than
> that.
>
>>
>> sub proposal:
>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>> cpu'. If so, it may can reduce one more time balancing.
>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>> 2, se or task load is good for running time setting.
>> but it should the second basis in load balancing. The first basis of LB
>> is running tasks' number in group/cpu. Since whatever of the weight of
>> groups is, if the tasks number is less than cpu number, the group is
>> still has capacity to take more tasks. (will consider the SMT cpu power
>> or other big/little cpu capacity on ARM.)
>
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.
>
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.

Observations of the runnable average also have the nice property that
it quickly converges to 100% when over-scheduled.

Since we also have the usage average for a single task the ratio of
used avg:runnable avg is likely a useful pointwise estimate.

>
> I think some of the Linaro people actually played around with this,
> Vincent?
>
>> unsolved issues:
>> 1, like current scheduler, it didn't handled cpu affinity well in
>> load_balance.
>
> cpu affinity is always 'fun'.. while there's still a few fun sites in
> the current load-balancer we do better than we did a while ago.
>
>> 2, task group that isn't consider well in this rough proposal.
>
> You mean the cgroup mess?
>
>> It isn't consider well and may has mistaken . So just share my ideas and
>> hope it become better and workable in your comments and discussion.
>
> Very simplistically the current scheme is a 'spread' the load scheme
> (SD_PREFER_SIBLING if you will). We spread load to maximize per-task
> cache and cpu power.
>
> The power scheme should be a 'pack' scheme, where we minimize the active
> power domains.
>
> One way to implement this is to keep track of an active and
> under-utilized power domain (the target) and fail the regular (pull)
> load-balance for all cpus not in that domain. For the cpu that are in
> that domain we'll have find_busiest select from all other under-utilized
> domains pulling tasks to fill our target, once full, we pick a new
> target, goto 1.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed, Aug 15, 2012 at 11:02 AM, Arjan van de Ven
<[email protected]> wrote:
> On 8/15/2012 9:34 AM, Matthew Garrett wrote:
>> On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
>>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>>>> It bases on the following assumption:
>>>> 1, If there are many task crowd in system, just let few domain cpus
>>>> running and let other cpus idle can not save power. Let all cpu take the
>>>> load, finish tasks early, and then get into idle. will save more power
>>>> and have better user experience.
>>>
>>> I'm not sure this is a valid assumption. I've had it explained to me by
>>> various people that race-to-idle isn't always the best thing. It has to
>>> do with the cost of switching power states and the duration of execution
>>> and other such things.
>>
>> This is affected by Intel's implementation - if there's a single active
>
> not just intel.. also AMD
> basically everyone who has the memory controller in the cpu package will end up with
> a restriction very similar to this.
>

I think this is circular to discussion previously held on this topic.
This preference is arch specific; we need to reduce the set of inputs
to a sensible, actionable set, and plumb that so that the architecture
and not the scheduler can supply this preference.

That you believe 100-300us is actually the tipping point vs power
migration cost is probably in itself one of the most useful replies
I've seen on this topic in all of the last few rounds of discussion
its been through. It suggests we could actually parameterize this in
a manner similar to wake-up migration cost; with a minimum usage
average for which it's worth spilling to an idle sibling.

- Paul

> (this is because the exit-from-self-refresh latency is pretty high.. at least in DDR2/3)
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
> > *Power policy*:
> >
> > So how is power policy different? As Peter says,'pack more than spread
> > more'.
>
> this is ... a dubiously general statement.
>
> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.

Is this really true? In a two-socket system I'd have thought the benefit
of keeping socket 1 in package C3 outweighed the cost of keeping socket
0 awake for slightly longer.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 8/17/2012 11:41 AM, Matthew Garrett wrote:
> On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
>>> *Power policy*:
>>>
>>> So how is power policy different? As Peter says,'pack more than spread
>>> more'.
>>
>> this is ... a dubiously general statement.
>>
>> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
>
> Is this really true? In a two-socket system I'd have thought the benefit
> of keeping socket 1 in package C3 outweighed the cost of keeping socket
> 0 awake for slightly longer.

not on Intel

you can't enter package c3 either until every one is down.
(e.g. memory controller must stay on etc etc)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote:
> On 8/17/2012 11:41 AM, Matthew Garrett wrote:
> > On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
> >> this is ... a dubiously general statement.
> >>
> >> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
> >
> > Is this really true? In a two-socket system I'd have thought the benefit
> > of keeping socket 1 in package C3 outweighed the cost of keeping socket
> > 0 awake for slightly longer.
>
> not on Intel
>
> you can't enter package c3 either until every one is down.
> (e.g. memory controller must stay on etc etc)

I thought that was only PC6 - is there any reason why the package cache
can't be entirely powered down?

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/17/2012 12:47 PM, Matthew Garrett wrote:
> On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote:
>> On 8/17/2012 11:41 AM, Matthew Garrett wrote:
>>> On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
>>>> this is ... a dubiously general statement.
>>>>
>>>> for good power, at least on Intel cpus, you want to spread. Parallelism is efficient.
>>> Is this really true? In a two-socket system I'd have thought the benefit
>>> of keeping socket 1 in package C3 outweighed the cost of keeping socket
>>> 0 awake for slightly longer.
>> not on Intel
>>
>> you can't enter package c3 either until every one is down.
>> (e.g. memory controller must stay on etc etc)
> I thought that was only PC6 - is there any reason why the package cache
> can't be entirely powered down?

According to
"http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf";
once you're in package C6 then you can go to package C7.

The datasheet for the Xeon E5 (my variant at least) says it doesn't do
C7 so never powers down the LLC. However, as you said earlier once you
can put the socket into C6 which saves about 30W compared to C1E.

So as far as I can see with this CPU at least you would benefit from
shutting down a whole socket when possible.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
> On 08/17/2012 12:47 PM, Matthew Garrett wrote:

> The datasheet for the Xeon E5 (my variant at least) says it doesn't
> do C7 so never powers down the LLC. However, as you said earlier
> once you can put the socket into C6 which saves about 30W compared
> to C1E.
>
> So as far as I can see with this CPU at least you would benefit from
> shutting down a whole socket when possible.

Having any active cores on the system prevents all packages from going
into PC6 or deeper. What I'm not clear on is whether less deep package C
states are also blocked.

--
Matthew Garrett | mjg59@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 08/17/2012 01:50 PM, Matthew Garrett wrote:
> On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
>> On 08/17/2012 12:47 PM, Matthew Garrett wrote:
>
>> The datasheet for the Xeon E5 (my variant at least) says it doesn't
>> do C7 so never powers down the LLC. However, as you said earlier
>> once you can put the socket into C6 which saves about 30W compared
>> to C1E.
>>
>> So as far as I can see with this CPU at least you would benefit from
>> shutting down a whole socket when possible.
>
> Having any active cores on the system prevents all packages from going
> into PC6 or deeper. What I'm not clear on is whether less deep package C
> states are also blocked.
>

Right, we need the memory controller.

The E5 datasheet is a bit ambiguous, it reads:


A processor enters the package C3 low power state when:
-At least one core is in the C3 state.
-The other cores are in a C3 or lower power state, and the processor
has been granted permission by the platform.


Unfortunately it doesn't specify whether that is the other cores in the
package, or the other cores on the whole system.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Sat, Aug 18, 2012 at 4:16 AM, Chris Friesen
<[email protected]> wrote:
> On 08/17/2012 01:50 PM, Matthew Garrett wrote:
>>
>> On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
>>>
>>> On 08/17/2012 12:47 PM, Matthew Garrett wrote:
>>
>>
>>> The datasheet for the Xeon E5 (my variant at least) says it doesn't
>>> do C7 so never powers down the LLC. However, as you said earlier
>>> once you can put the socket into C6 which saves about 30W compared
>>> to C1E.
>>>
>>> So as far as I can see with this CPU at least you would benefit from
>>> shutting down a whole socket when possible.
>>
>>
>> Having any active cores on the system prevents all packages from going
>> into PC6 or deeper. What I'm not clear on is whether less deep package C
>> states are also blocked.
>>
>
> Right, we need the memory controller.
>
> The E5 datasheet is a bit ambiguous, it reads:
>
>
> A processor enters the package C3 low power state when:
> -At least one core is in the C3 state.
> -The other cores are in a C3 or lower power state, and the processor has
> been granted permission by the platform.
>
>
> Unfortunately it doesn't specify whether that is the other cores in the
> package, or the other cores on the whole system.
>

Hardware limitations is just part of the problem. We could find them
out from various white papers or data sheets, or test out.To me, the
key problem in terms of power and performance balancing still lies in
CPU and memory allocation method. For example, on a system we can
benefit from shutting down a whole socket when possible, if a workload
allocates 50% CPU cycles and 50% memory bandwidth and space on a two
socket system(modern), an ideal allocation method ( I assume it's our
goal of the discussion) should leave CPU, cache, memory controller and
memory on one socket ( node) completely idle and in deepest power
saving mode. But obviously, we need to spread as much as possible
across all cores in another socket(to race to idle). So from the
example above, we see a threshold that we need to reference before
selecting one from two complete different policy: spread or not
spread... As long as there is hardware limitation, we could always
need knob like that referenced threshold to adapt on different
hardware in one kernel....

/l
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 8/18/2012 7:33 AM, Luming Yu wrote:
> saving mode. But obviously, we need to spread as much as possible
> across all cores in another socket(to race to idle). So from the
> example above, we see a threshold that we need to reference before
> selecting one from two complete different policy: spread or not
> spread... As long as there is hardware limitation, we could always
> need knob like that referenced threshold to adapt on different
> hardware in one kernel....

I think the physics are slightly simpler, if you abstract it one level.

every reasonable system out there has things that can be off if all cores are in the deep power state,
that have to be on if even one of them is alive. On "big core" Intel, that's uncore and memory controller,
on small core (atom/phone) Intel that is the chipset fabric only. On ARM it might be something else. On all of
them it's some clocks, PLLs, voltage regulators etc etc.

not all chips are advanced enough to aggressively these things off when they could, but most are nowadays.

so in abstract, there's a power offset that gets you from 0 to 1, Lets call this P0
there is also a power offset to go from 1 to 2, but that's smaller than 0->1. Lets call this Pc

or rather, 0->1 has the same kind of offset as 1->2 plus some extra offset.. so P0 = Pbase + Pc

there's also an energy cost for waking a cpu up (and letting it go back to sleep afterwards)... call it Ewake

so the abstract question is
you're running a task A on cpu 0
you want to also run a task B, which you estimate to run for time T

it's more energy efficient to wake a 2nd cpu if

Ewake < T * Pbase

(this assumes all cores are the same, you get a more complex formula if that's not the case, where T is even core specific)


there is no hardware policy *switch* in such formula, only parameters.
If Pbase = 0 (e.g. your hardware has no extra power savings), then the formula very naturally leads to one extreme of the behavior
if Ewake is very high, then it leads to the other extreme.

The only other variable is the user preference between power and performance balance.. but that's a pure preference, not hardware
specific anymore.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Hi all,
I can probably add some bits to the discussion, after all I'm preparing
a talk for Plumbers that is strictly related :-). My points are not CFS
related (so feel free to ignore me), but they would probably be
interesting if we talk about power aware scheduling in Linux in general.

On 08/16/2012 04:31 PM, Morten Rasmussen wrote:
> Hi all,
>
> On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote:
>>>
>>> sub proposal:
>>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>>> cpu'. If so, it may can reduce one more time balancing.
>>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>>> 2, se or task load is good for running time setting.
>>> but it should the second basis in load balancing. The first basis of LB
>>> is running tasks' number in group/cpu. Since whatever of the weight of
>>> groups is, if the tasks number is less than cpu number, the group is
>>> still has capacity to take more tasks. (will consider the SMT cpu power
>>> or other big/little cpu capacity on ARM.)
>>
>> Ah, no we shouldn't balance on nr_running, but on the amount of time
>> consumed. Imagine two tasks being woken at the same time, both tasks
>> will only run a fraction of the available time, you don't want this to
>> exceed your capacity because ran back to back the one cpu will still be
>> mostly idle.
>>
>> What you want it to keep track of a per-cpu utilization level (inverse
>> of idle-time) and using PJTs per-task runnable avg see if placing the
>> new task on will exceed the utilization limit.
>>
>> I think some of the Linaro people actually played around with this,
>> Vincent?
>>
>
> I agree. A better measure of cpu load and task weight than nr_running
> and the current task load weight are necessary to do proper task
> packing.
>
> I have used PJTs per-task load-tracking for scheduling experiments on
> heterogeneous systems and my experience is that it works quite well for
> determining the load of a specific task. Something like PJTs work
> would be a good starting point for power aware scheduling and better
> support for heterogeneous systems.
>

I didn't tried PJTs work myself (it's on my todo list), but with
SCHED_DEADLINE you can see the picture from the other side and, instead
of tracking per-task load, you can enforce a task not to exceed its
allowed "load".
This is done reserving some fraction of CPU time (runtime or budget)
every predefined interval of time (period). Than this allocated
bandwidth is enforced with proper scheduling mechanisms (BTW, I have
another talk at Plumbers explaining the SCHED_DEADLINE patchset in more
details).

> One of the biggest challenges here for load-balancing is translating
> task load from one cpu to another as the task load is influenced by the
> total load of its cpu. So a task that appears to be heavy on an
> oversubscribed cpu might not be so heavy after all when it is moved to a
> cpu with plenty cpu time to spare. This issue is likely to be more
> pronounced on heterogeneous systems and system with aggressive frequency
> scaling. It might be possible to avoid having to translate load or that
> it doesn't really matter, but I haven't completely convinced myself yet.
>

This is probably a key point where deadline scheduling could be helpful.
A task load in this case cannot be influenced by other tasks in the
system and it is one of the known variables. Actually, this is however
half true. Isolation is achieved only considering CPU time between
concurrently executing task, other terms like cache interferences etc.
cannot be controlled. The nice fact is that a misbehaving task, one that
tries or experiments deviations from its allowed CPU fraction, is
throttled and cannot influence other tasks behavior.
As I will show during my talk (power aware deadline scheduling), other
techniques are required when a task execution time it is not stricly
known beforehand, beeing this due to interferences or intrinsic
variability on the performed activity. They fall in the domain of
adaptive/feedback scheduling.

> My point is that getting the task load right or at least better is a
> fundamental requirement for improving power aware scheduling.
>

Fully agree :-).

As I said, I just wanted to add something, sorry if I misinterpret the
purpose of this discussion.

Best Regards,

- Juri Lelli
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
* Arjan van de Ven <[email protected]> wrote:

> On 8/15/2012 8:04 AM, Peter Zijlstra wrote:
>
> > This all sounds far too complicated.. we're talking about
> > simple spreading and packing balancers without deep arch
> > knowledge and knobs, we couldn't possibly evaluate anything
> > like that.
> >
> > I was really more thinking of something useful for the
> > laptops out there, when they pull the power cord it makes
> > sense to try and keep CPUs asleep until the one that's awake
> > is saturated.

s/CPU/core ?

> as long as you don't do that on machines with an Intel CPU..
> since that'd be the worst case behavior for tasks that run for
> more than 100 usec. (e.g. not interrupts, but almost
> everything else)

The question is, do we need to balance for 'power saving', on
systems that care more about power use than they care about peak
performance/throughput, at all?

If the answer is 'no' then things get rather simple.

If the answer is 'yes' then there's clear cases where the kernel
(should) automatically know the events where we switch from
balancing for performance to balancing for power:

- the system boots up on battery

- the system was on AC but the cord has been pulled and the
system is now on battery

- the administrator configures the system on AC to be
power-conscious.

( and the opposite direction events wants the scheduler to
switch from 'balancing for power' to 'balancing for
performance'. )

There's also cases where the kernel has insufficient information
from the hardware and from the admin about the preferred
characteristics/policy of the system - a tweakable fallback knob
might be provided for that sad case.

The point is, that knob is not the policy setting and it's not
the main mechanism. It's a fallback.

Thanks,

Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Mon, 2012-08-20 at 10:06 +0200, Ingo Molnar wrote:
> > > I was really more thinking of something useful for the
> > > laptops out there, when they pull the power cord it makes
> > > sense to try and keep CPUs asleep until the one that's awake
> > > is saturated.
>
> s/CPU/core ?

I was thinking logical cpus, but whatever really.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 8/20/2012 1:06 AM, Ingo Molnar wrote:
>
>
> There's also cases where the kernel has insufficient information
> from the hardware and from the admin about the preferred
> characteristics/policy of the system - a tweakable fallback knob
> might be provided for that sad case.
>
> The point is, that knob is not the policy setting and it's not
> the main mechanism. It's a fallback.

if we call the knob "powersave", it better save power...
if we call it "group together" or "spread out".. no problem with that.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 15 August 2012 13:05, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.
>
>> 2, schedule domain, schedule group perfect match the hardware, and
>> the power consumption unit. So, pull tasks out of a domain means
>> potentially this power consumption unit idle.
>
> I'm not sure I understand what you're saying, sorry.
>
>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.
> But this seems like a later concern, you have to have something to pick
> between before you can pick :-)
>
>> And in scheduling, 2 place will care the policy, load_balance() and in
>> task fork/exec: select_task_rq_fair().
>
> ack
>
>> Here is some pseudo code try to explain the proposal behaviour in
>> load_balance() and select_task_rq_fair();
>
> Oh man.. A few words outlining the general idea would've been nice.
>
>> load_balance() {
>> update_sd_lb_stats(); //get busiest group, idlest group data.
>>
>> if (sd->nr_running > sd's capacity) {
>> //power saving policy is not suitable for
>> //this scenario, it runs like performance policy
>> mv tasks from busiest cpu in busiest group to
>> idlest cpu in idlest group;
>
> Once upon a time we talked about adding a factor to the capacity for
> this. So say you'd allow 2*capacity before overflowing and waking
> another power group.
>
> But I think we should not go on nr_running here, PJTs per-entity load
> tracking stuff gives us much better measures -- also, repost that series
> already Paul! :-)
>
> Also, I'm not sure this is entirely correct, the thing you want to do
> for power aware stuff is to minimize the number of active power domains,
> this means you don't want idlest, you want least busy non-idle.
>
>> } else {// the sd has enough capacity to hold all tasks.
>> if (sg->nr_running > sg's capacity) {
>> //imbalanced between groups
>> if (schedule policy == performance) {
>> //when 2 busiest group at same busy
>> //degree, need to prefer the one has
>> // softest group??
>> move tasks from busiest group to
>> idletest group;
>
> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.
>
>> } else if (schedule policy == power)
>> move tasks from busiest group to
>> idlest group until busiest is just full
>> of capacity.
>> //the busiest group can balance
>> //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains. The old code (that got deleted) did
> something like that, you can revive some of the that code if needed -- I
> just killed everything to be able to start with a clean slate.
>
>
>> } else {
>> //all groups has enough capacity for its tasks.
>> if (schedule policy == performance)
>> //all tasks may has enough cpu
>> //resources to run,
>> //mv tasks from busiest to idlest group?
>> //no, at this time, it's better to keep
>> //the task on current cpu.
>> //so, it is maybe better to do balance
>> //in each of groups
>> for_each_imbalance_groups()
>> move tasks from busiest cpu to
>> idlest cpu in each of groups;
>> else if (schedule policy == power) {
>> if (no hard pin in idlest group)
>> mv tasks from idlest group to
>> busiest until busiest full.
>> else
>> mv unpin tasks to the biggest
>> hard pin group.
>> }
>> }
>> }
>> }
>
> OK, so you only start to group later.. I think we can do better than
> that.
>
>>
>> sub proposal:
>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>> cpu'. If so, it may can reduce one more time balancing.
>> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
>> 2, se or task load is good for running time setting.
>> but it should the second basis in load balancing. The first basis of LB
>> is running tasks' number in group/cpu. Since whatever of the weight of
>> groups is, if the tasks number is less than cpu number, the group is
>> still has capacity to take more tasks. (will consider the SMT cpu power
>> or other big/little cpu capacity on ARM.)
>
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.
>
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.
>
> I think some of the Linaro people actually played around with this,
> Vincent?

Sorry for the late reply but I had almost no network access during last weeks.

So Linaro also works on a power aware scheduler as Peter mentioned.

Based on previous tests, we have concluded that main drawback of the
(now removed) old power scheduler was that we had no way to make
difference between short and long running tasks whereas it's a key
input (at least for phone) for deciding to pack tasks and for
selecting the core on an asymmetric system.
One additional key information is the power distribution in the system
which can have a finer granularity than current sched_domain
description. Peter's proposal was to use a SHARE_POWERLINE flag
similarly to flags that already describe if a sched_domain share
resources or cpu capacity.

With these 2 new information, we can have a 1st power saving scheduler
which spread or packed tasks across core and package

Vincent
>
>> unsolved issues:
>> 1, like current scheduler, it didn't handled cpu affinity well in
>> load_balance.
>
> cpu affinity is always 'fun'.. while there's still a few fun sites in
> the current load-balancer we do better than we did a while ago.
>
>> 2, task group that isn't consider well in this rough proposal.
>
> You mean the cgroup mess?
>
>> It isn't consider well and may has mistaken . So just share my ideas and
>> hope it become better and workable in your comments and discussion.
>
> Very simplistically the current scheme is a 'spread' the load scheme
> (SD_PREFER_SIBLING if you will). We spread load to maximize per-task
> cache and cpu power.
>
> The power scheme should be a 'pack' scheme, where we minimize the active
> power domains.
>
> One way to implement this is to keep track of an active and
> under-utilized power domain (the target) and fail the regular (pull)
> load-balance for all cpus not in that domain. For the cpu that are in
> that domain we'll have find_busiest select from all other under-utilized
> domains pulling tasks to fill our target, once full, we pick a new
> target, goto 1.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On 16 August 2012 07:03, Alex Shi <[email protected]> wrote:
> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>
>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Are there workloads in which "power" might provide more performance than
>> "performance"? If so, don't use these terms.
>>
>
>
> Power scheme should no chance has better performance in design.

A side effect of packing small tasks on one core is that you always
use the core with the lowest C-state which will minimize the wake up
latency so you can sometime get better results than performance mode
which will try to use a other core in another cluster which will take
more time to wake up that waiting for the end of the current task.

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
One issue that is often forgotten is that there are users who want lowest
latency and not highest performance. Our systems sit idle for most of the
time but when a specific event occurs (typically a packet is received)
they must react in the fastest way possible.

On every new generation of hardware and software we keep on running into
various mechanisms that automatically power down when idle for a long time
(to save power...). And its pretty hard to figure these things out given
the complexity of modern hardware. F.e. for the Sandybridges we found that
the memory channel powers down after 2 milliseconds idle time and that was
unaffected by any of the bios config options. Similar mechanisms exist in
the kernel but those are easier discover since there is source.

So please make sure that there are obvious and easy ways to switch this
stuff off or provide "low latency" know that keeps the system from
assuming that idle time means that full performance is not needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Sorry, only registered users may post in this forum.

Click here to login