Welcome! Log In Create A New Profile

Advanced

Priority based queuing

Posted by Patrick Hemmer 
Patrick Hemmer
Priority based queuing
May 02, 2018 03:40AM
Would it be possible to add priority based queuing to haproxy? By this I
mean that when a server/backend is full (maxconn), that incoming
requests would be added to the queue in a custom order. The idea here is
that when the system is under stress, to make sure the important
requests get handled first.

In our exact use case, we're looking to use this to help mitigate DOS
attacks. The idea is that if a layer 7 attack is saturating the backend
servers, we can add logic to prioritize the requests. This logic might
be things like requests that have a valid application cookie go to the
front of the queue, or requests that come from a cloud provider (e.g.
EC2) go to the back of the queue.
DOS mitigation is hard because while you can write rules to identify
requests that are suspicious, you don't want to block them outright as
it is possible they might be legitimate. With prioritization, the
requests still get through, and are only affected when the backend is
saturated. If maxconn is not reached, the prioritization has no effect
at all (since queue is empty).

I made the change to haproxy and simulated the conditions in a lab, and
the strategy appears to work.
The change to haproxy was very minor, ~10 lines in queue.c, using
`task->nice` as the prioritization key. However my change is a very
rough PoC, and not worthy of submission.
So before continuing any further down this route, I wanted to see if
this is something that could make it into HAProxy, and what any thoughts
on it might be.

-Patrick
Willy Tarreau
Re: Priority based queuing
May 02, 2018 05:10PM
On Tue, May 01, 2018 at 09:34:14PM -0400, Patrick Hemmer wrote:
> Would it be possible to add priority based queuing to haproxy? By this I
> mean that when a server/backend is full (maxconn), that incoming
> requests would be added to the queue in a custom order. The idea here is
> that when the system is under stress, to make sure the important
> requests get handled first.

Hehe that's fun that you mention this, as this has been postponed since
around 1.2 or 1.3! By then we didn't have the equivalent of HTTP rules
to add/subtract some priority. Now we have everything to do it, we "just"
need to replace the lists with priority trees in the queues and that's
all. It's not a big work if someone is interested in working on this.

> In our exact use case, we're looking to use this to help mitigate DOS
> attacks. The idea is that if a layer 7 attack is saturating the backend
> servers, we can add logic to prioritize the requests. This logic might
> be things like requests that have a valid application cookie go to the
> front of the queue, or requests that come from a cloud provider (e.g.
> EC2) go to the back of the queue.

That's exactly why I wanted them to be manipulated vi http-request rules,
so that everyone can construct his own conditions. Also I found that for
most shopping sites, having time-based priority is more important than
position-based : you often want this type of request to be processed 100ms
faster than another type of request. With HTTP/2 it will be even more
interesting because it will allow to send the important objects used for
rendering before the other ones, which is very similar to the H2 priority
but more fine-grained if you can adjust it on the fly.

> DOS mitigation is hard because while you can write rules to identify
> requests that are suspicious, you don't want to block them outright as
> it is possible they might be legitimate. With prioritization, the
> requests still get through, and are only affected when the backend is
> saturated. If maxconn is not reached, the prioritization has no effect
> at all (since queue is empty).

I wholeheartly agree with you :-)

> I made the change to haproxy and simulated the conditions in a lab, and
> the strategy appears to work.
> The change to haproxy was very minor, ~10 lines in queue.c, using
> `task->nice` as the prioritization key. However my change is a very
> rough PoC, and not worthy of submission.

For a rough PoC it's indeed perfectly fine. But for a final design we
really need a separate offset. I've really been convinced in field about
using time rather than position, if you want to experiment with this I
can give you some insights, it's the same in fact.

> So before continuing any further down this route, I wanted to see if
> this is something that could make it into HAProxy, and what any thoughts
> on it might be.

Absolutely! I've dreamed of it for over a decade, so I'm glad someone
is willing to take care of it! Just checked, the item was added 12
years ago to the roadmap file in 1.2.13 on 2006/05/13 by commit 814cbc6
("[DOC] added (and updated) the ROADMAP file"). The lines were :

- wait queues replaced for priority-based trees
- ability to assign a prio based on L7 matching

The goal has not changed since, I'm patient :-)

Willy
Patrick Hemmer
Re: Priority based queuing
May 02, 2018 06:50PM
On 2018/5/2 11:04, Willy Tarreau wrote:
> On Tue, May 01, 2018 at 09:34:14PM -0400, Patrick Hemmer wrote:
>> Would it be possible to add priority based queuing to haproxy? By this I
>> mean that when a server/backend is full (maxconn), that incoming
>> requests would be added to the queue in a custom order. The idea here is
>> that when the system is under stress, to make sure the important
>> requests get handled first.
> Hehe that's fun that you mention this, as this has been postponed since
> around 1.2 or 1.3! By then we didn't have the equivalent of HTTP rules
> to add/subtract some priority. Now we have everything to do it, we "just"
> need to replace the lists with priority trees in the queues and that's
> all. It's not a big work if someone is interested in working on this.
>
>> In our exact use case, we're looking to use this to help mitigate DOS
>> attacks. The idea is that if a layer 7 attack is saturating the backend
>> servers, we can add logic to prioritize the requests. This logic might
>> be things like requests that have a valid application cookie go to the
>> front of the queue, or requests that come from a cloud provider (e.g.
>> EC2) go to the back of the queue.
> That's exactly why I wanted them to be manipulated vi http-request rules,
> so that everyone can construct his own conditions. Also I found that for
> most shopping sites, having time-based priority is more important than
> position-based : you often want this type of request to be processed 100ms
> faster than another type of request. With HTTP/2 it will be even more
> interesting because it will allow to send the important objects used for
> rendering before the other ones, which is very similar to the H2 priority
> but more fine-grained if you can adjust it on the fly.
>
>> DOS mitigation is hard because while you can write rules to identify
>> requests that are suspicious, you don't want to block them outright as
>> it is possible they might be legitimate. With prioritization, the
>> requests still get through, and are only affected when the backend is
>> saturated. If maxconn is not reached, the prioritization has no effect
>> at all (since queue is empty).
> I wholeheartly agree with you :-)
>
>> I made the change to haproxy and simulated the conditions in a lab, and
>> the strategy appears to work.
>> The change to haproxy was very minor, ~10 lines in queue.c, using
>> `task->nice` as the prioritization key. However my change is a very
>> rough PoC, and not worthy of submission.
> For a rough PoC it's indeed perfectly fine. But for a final design we
> really need a separate offset. I've really been convinced in field about
> using time rather than position, if you want to experiment with this I
> can give you some insights, it's the same in fact.
Can you elaborate on what you're thinking of for a time-based queue?

What I'm imagining you mean is that you would write a rule to set the
max queue time, and haproxy would insert it into the queue sorting on
TIME_NOW() + MAX_QUEUE_TIME. The main difference I see to this approach
vs scoring, is that you ensure that an item doesn't sit on the queue
forever (or whatever `timeout queue` is set to) if higher priority stuff
keeps getting inserted before it.
I don't think this is necessarily a good thing. If you're under a DOS
attack, the goal is to get the good requests processed before any
possible malicious requests. With a time based queue, those malicious
requests will still get processed and starve out the good requests. For
example lets say you're under attack, a bad request comes in with
max_queue_time=1000ms, and then after 999ms elapse, a good request comes
in with max_queue_time=10ms. You have a good request, and a bad request
on the queue, but HAProxy is going to process the bad request first
because its timer is expiring first. Essentially if haproxy is receiving
X good requests per second, and Y bad requests per second, it's still
going to forward X good per second, and Y bad per second, to the backend
server. The only difference is that they're time shifted.

The other thing I could think you mean by time-based is to insert into
the queue sorting on MAX_QUEUE_TIME, just like a score-based queue, but
you would still record TIME_NOW() + MAX_QUEUE_TIME, and would reject
requests that don't get processed by their deadline. Essentially a
per-request version of the `timeout queue` setting. But I don't see any
real value in this.

Or do you mean something else?

>
>> So before continuing any further down this route, I wanted to see if
>> this is something that could make it into HAProxy, and what any thoughts
>> on it might be.
> Absolutely! I've dreamed of it for over a decade, so I'm glad someone
> is willing to take care of it! Just checked, the item was added 12
> years ago to the roadmap file in 1.2.13 on 2006/05/13 by commit 814cbc6
> ("[DOC] added (and updated) the ROADMAP file"). The lines were :
>
> - wait queues replaced for priority-based trees
> - ability to assign a prio based on L7 matching
>
> The goal has not changed since, I'm patient :-)
>
> Willy
Willy Tarreau
Re: Priority based queuing
May 02, 2018 07:30PM
On Wed, May 02, 2018 at 12:44:06PM -0400, Patrick Hemmer wrote:
> Can you elaborate on what you're thinking of for a time-based queue?
>
> What I'm imagining you mean is that you would write a rule to set the
> max queue time, and haproxy would insert it into the queue sorting on
> TIME_NOW() + MAX_QUEUE_TIME. The main difference I see to this approach
> vs scoring, is that you ensure that an item doesn't sit on the queue
> forever (or whatever `timeout queue` is set to) if higher priority stuff
> keeps getting inserted before it.

In part, but mostly that under contention you can still maintain a
good quality of service. I remember having had a discussion a very
long time ago with a gaming site operator explaining that he wanted
to send profile management requests to a dedicated slow server so
that people filling in their profile do not disturb the ones playing.
For me it's a good example : "how long do you think these people are
willing to wait for the server to respond to each click to update
their profile?". Let's say 2 seconds is OK, you just add a 2s offset
to the position in the queue for such requests. If in the mean time
other higher priority requests come in, the delayed one may face the
2 seconds delay. But it will not face more. And obviously if no other
request is pending before it in the queue it will be served earlier.

A similar example is for certain shopping sites which consider that
once you have something in your shopping cart, you're much more likely
to finish the process and to pay because you've found the item you
were looking for, so you're more willing to tolerate a slightly higher
response time, and as such provide a better response time to newcomers.
But while this small delay can easily be expressed in milliseconds
(probably 50/100), it's almost impossible to define it using positions.
What if 50 new visitors suddenly come in ? You don't want the guy with
his item to suddenly experience a timeout. The time-based queue however
grants you a control over the service degradation you're accepting to
inflict to your users.

Another interesting example is when you want ot provide strong service
guarantees to some top-level visitors while running under intense load.
By using only a position, it's easy to say "I want the Gold class to
advance by 1000 positions". But if the site is highly loaded and you
have 10k pending requests, these 1000 positions remain irrelevant,
because the requests sits there, waiting for 90% of the pollution to
be flushed. If you say "I want these requests to advance by 10 seconds"
then no matter how many other requests you have in the queue, it will
really advance by 10 seconds and effectively shrink the response time
by 10 seconds.

Overall for user experience it's important to think in time and not
positions. The rationale behind this is simple : the user will never
accept as a justification for varying quality of service the number of
other visitors. And positions just depend on this, it's an average number
of people you're allowing to pass over. If I'm granted a pass ticket
allowing me to advance 10 places in the cinema queue and I find a
crowded street, I will go back home. If I'm told I won't wait more
than 5 minutes, I'll come regardless of the number of waiters.

> I don't think this is necessarily a good thing.

For having considered the two options several times in field over the
last decade when sites started to crawl, I became very well convinced ;-)

> If you're under a DOS
> attack, the goal is to get the good requests processed before any
> possible malicious requests. With a time based queue, those malicious
> requests will still get processed and starve out the good requests.

Precisely not that much because the good ones will pass before, and
the malicious ones will then be subject to the queue timeout if there
are too many.

> For
> example lets say you're under attack, a bad request comes in with
> max_queue_time=1000ms, and then after 999ms elapse, a good request comes
> in with max_queue_time=10ms. You have a good request, and a bad request
> on the queue, but HAProxy is going to process the bad request first
> because its timer is expiring first.

Absolutely but you fell into the trap of not considering that the queue
is supposed to be full, so you're in fact comparing only two requests,
while in practice you have many of them and there the effect is much
more visible.

> Essentially if haproxy is receiving
> X good requests per second, and Y bad requests per second, it's still
> going to forward X good per second, and Y bad per second, to the backend
> server. The only difference is that they're time shifted.

Except once subject to the queue timeout. It's a very important aspect we
must not dismiss.

> The other thing I could think you mean by time-based is to insert into
> the queue sorting on MAX_QUEUE_TIME, just like a score-based queue, but
> you would still record TIME_NOW() + MAX_QUEUE_TIME, and would reject
> requests that don't get processed by their deadline. Essentially a
> per-request version of the `timeout queue` setting. But I don't see any
> real value in this.
>
> Or do you mean something else?

What I mean is that the queue is indexed on the computed service time
which is (current_time + offset), where offset can be either positive or
negative, null by default. The queue is sorted by service time, just like
any timer in fact. Then the queue collection simply picks the next event
in the queue.

Please also note that right now the queue already considers the time to
maintain fairness between requests targetting a specific server and those
who just want any server. This is what ensures no request starves in either
the server's queue or the backend's queue. With a time-sorted queue, instead
of picking the tasks according to their time of arrival, we'd simply pick
them based on the service time, and we could maintain an excellent fairness
between servers and backends.

Regarding the DOS situations, there are certain aspects which slightly
differ from the points around the quality of service. DOS fighting involves
sacrifice, and basic quality of service hardly provides this by default. In
our case, sacrifice happens on a queue size or time. And the decision should
depend on the request's importance, which is often very different from the
desired response time. The well known example is the "Pay" button on many
sites : you click on it, and if it doesn't respond fast, you will not press
Escape. You're keeping fingers crossed hoping it will not return an error,
even if it takes 30 seconds. And moreover, you're happy once it responds!
Here that's what makes the difference with the response time : in fact you'd
like to say that certain requests are highly sensitive but not urgent, and
that you'd like to be able to increase their timeout and ultimately get
served.

To fight DOS it's the same. Commonly, many sites consider that requests
holding a valid cookie correspond to authenticated users and score much
better in terms of trustability. Thus by having adjustable timeouts, it
would make it very easy to adjust the queue timeout for a request based
on the presence of a cookie or not.

Now when you think about it, a reasonable but simple strategy for some of
the examples above could be summarized to this :

timeout queue 10s
# pay button not urgent but needs to work
http-request set-timeout queue 60s if PAY_BUTTON
http-request set-priority -10s if PAY_BUTTON

# unauthenticated less urgent, can be a DOS
http-request set-priority -1s if NO_COOKIE || FAVICON

# some elements that need to be deliveered quickly
http-request set-priority 1s if INDEX_HTML || CSS || JS || ICONS

# auto-completion needs to be fast but no problem if it doesn't work
http-request set-timeout queue 1s if AUTO_COMPLETION_REQUEST
http-request set-priority 10s if AUTO_COMPLETION_REQUEST

# inconsistent user-agents are suspicious, delay them just in case
http-request set-priority -20s if SUSPICIOUS_USER_AGENT

# too fast users need to slow down a little bit
http-request set-priority -20s if { sc0_rate gt 10 } || { sc0_err_cnt gt 10 }

With a proper API we can even imagine having access to these request
properties from Lua to implement far more advanced policies.

Regards,
Willy
Patrick Hemmer
Re: Priority based queuing
May 02, 2018 10:40PM
On 2018/5/2 13:22, Willy Tarreau wrote:
> On Wed, May 02, 2018 at 12:44:06PM -0400, Patrick Hemmer wrote:
>> Can you elaborate on what you're thinking of for a time-based queue?
>>
>> What I'm imagining you mean is that you would write a rule to set the
>> max queue time, and haproxy would insert it into the queue sorting on
>> TIME_NOW() + MAX_QUEUE_TIME. The main difference I see to this approach
>> vs scoring, is that you ensure that an item doesn't sit on the queue
>> forever (or whatever `timeout queue` is set to) if higher priority stuff
>> keeps getting inserted before it.
> In part, but mostly that under contention you can still maintain a
> good quality of service. I remember having had a discussion a very
> long time ago with a gaming site operator explaining that he wanted
> to send profile management requests to a dedicated slow server so
> that people filling in their profile do not disturb the ones playing.
> For me it's a good example : "how long do you think these people are
> willing to wait for the server to respond to each click to update
> their profile?". Let's say 2 seconds is OK, you just add a 2s offset
> to the position in the queue for such requests. If in the mean time
> other higher priority requests come in, the delayed one may face the
> 2 seconds delay. But it will not face more. And obviously if no other
> request is pending before it in the queue it will be served earlier.
>
> A similar example is for certain shopping sites which consider that
> once you have something in your shopping cart, you're much more likely
> to finish the process and to pay because you've found the item you
> were looking for, so you're more willing to tolerate a slightly higher
> response time, and as such provide a better response time to newcomers.
> But while this small delay can easily be expressed in milliseconds
> (probably 50/100), it's almost impossible to define it using positions.
> What if 50 new visitors suddenly come in ? You don't want the guy with
> his item to suddenly experience a timeout. The time-based queue however
> grants you a control over the service degradation you're accepting to
> inflict to your users.
There's a key difference in our designs. You're trying to classify
traffic based on destination. I'm trying to classify traffic based on
source.
In my design, the maxconn is set to a level such that the server is
healthy and every request can be processed fast. With this, and a score
based queue, and under attack, the good requests will be drained fast,
leaving only the bad requests. Thus no normal user should be getting a
timeout, no matter what resource they're going to.
Your design works when the backend system can't keep up with good
legitimate traffic, and you need to prioritize one good request over
another good request. My design is when the backend system can't keep up
with bad traffic, and so we want to process the good traffic first.

>
> Another interesting example is when you want ot provide strong service
> guarantees to some top-level visitors while running under intense load.
> By using only a position, it's easy to say "I want the Gold class to
> advance by 1000 positions". But if the site is highly loaded and you
> have 10k pending requests, these 1000 positions remain irrelevant,
> because the requests sits there, waiting for 90% of the pollution to
> be flushed.
I think you're misunderstanding my design, as scoring wouldn't work like
this at all. If you give the gold class a score of 1000 (where higher
number means higher priority), then the only thing that would get
processed before gold class would be another class that has a score >
1000. If nothing does, and a gold request comes in, it gets processed
first, no matter how big the queue is.
Think of scoring as instead having multiple queues. Each queue is only
processed if the queue before it is empty.
> If you say "I want these requests to advance by 10 seconds"
> then no matter how many other requests you have in the queue, it will
> really advance by 10 seconds and effectively shrink the response time
> by 10 seconds.
It may shrink the response time by 10 seconds, but if the normal
response time is 60 seconds, that's still 50 seconds of waiting. If your
users are only willing to put up with 30 seconds of wait, this means you
end up failing *everyone*.

>
> Overall for user experience it's important to think in time and not
> positions. The rationale behind this is simple : the user will never
> accept as a justification for varying quality of service the number of
> other visitors. And positions just depend on this, it's an average number
> of people you're allowing to pass over. If I'm granted a pass ticket
> allowing me to advance 10 places in the cinema queue and I find a
> crowded street, I will go back home. If I'm told I won't wait more
> than 5 minutes, I'll come regardless of the number of waiters.
But a time based queue wouldn't work like this. You can't guarantee
service within 5 minutes. What happens if a ton of people arrive at the
same time who were told "you will never wait more than 5 minutes", and
the cinema can't handle them all? Or if you're the only person in the
line and nobody leaves the cinema for 6 minutes? Or there are already
100 people in line who were told "you will never wait more than 1 hour",
and that hour is almost up?

>
>> I don't think this is necessarily a good thing.
> For having considered the two options several times in field over the
> last decade when sites started to crawl, I became very well convinced ;-)
>
>> If you're under a DOS
>> attack, the goal is to get the good requests processed before any
>> possible malicious requests. With a time based queue, those malicious
>> requests will still get processed and starve out the good requests.
> Precisely not that much because the good ones will pass before, and
> the malicious ones will then be subject to the queue timeout if there
> are too many.
>
>> For
>> example lets say you're under attack, a bad request comes in with
>> max_queue_time=1000ms, and then after 999ms elapse, a good request comes
>> in with max_queue_time=10ms. You have a good request, and a bad request
>> on the queue, but HAProxy is going to process the bad request first
>> because its timer is expiring first.
> Absolutely but you fell into the trap of not considering that the queue
> is supposed to be full, so you're in fact comparing only two requests,
> while in practice you have many of them and there the effect is much
> more visible.
I don't think so. Server connections are full, not the queue. The point
I was trying to make is that you have a request you are pretty confident
is bad, and you waste resources processing it while the request you
think is good gets to wait. No matter how big the queue is, as long as a
bad request sits in front of a good request in the queue (regardless of
whether good request chronologically came after bad, or bad request
chronologically came after good), this applies.
>
>> Essentially if haproxy is receiving
>> X good requests per second, and Y bad requests per second, it's still
>> going to forward X good per second, and Y bad per second, to the backend
>> server. The only difference is that they're time shifted.
> Except once subject to the queue timeout. It's a very important aspect we
> must not dismiss.
>
>> The other thing I could think you mean by time-based is to insert into
>> the queue sorting on MAX_QUEUE_TIME, just like a score-based queue, but
>> you would still record TIME_NOW() + MAX_QUEUE_TIME, and would reject
>> requests that don't get processed by their deadline. Essentially a
>> per-request version of the `timeout queue` setting. But I don't see any
>> real value in this.
>>
>> Or do you mean something else?
> What I mean is that the queue is indexed on the computed service time
> which is (current_time + offset), where offset can be either positive or
> negative, null by default. The queue is sorted by service time, just like
> any timer in fact. Then the queue collection simply picks the next event
> in the queue.
>
> Please also note that right now the queue already considers the time to
> maintain fairness between requests targetting a specific server and those
> who just want any server. This is what ensures no request starves in either
> the server's queue or the backend's queue. With a time-sorted queue, instead
> of picking the tasks according to their time of arrival, we'd simply pick
> them based on the service time, and we could maintain an excellent fairness
> between servers and backends.
>
> Regarding the DOS situations, there are certain aspects which slightly
> differ from the points around the quality of service. DOS fighting involves
> sacrifice, and basic quality of service hardly provides this by default. In
> our case, sacrifice happens on a queue size or time. And the decision should
> depend on the request's importance, which is often very different from the
> desired response time. The well known example is the "Pay" button on many
> sites : you click on it, and if it doesn't respond fast, you will not press
> Escape. You're keeping fingers crossed hoping it will not return an error,
> even if it takes 30 seconds. And moreover, you're happy once it responds!
> Here that's what makes the difference with the response time : in fact you'd
> like to say that certain requests are highly sensitive but not urgent, and
> that you'd like to be able to increase their timeout and ultimately get
> served.
>
> To fight DOS it's the same. Commonly, many sites consider that requests
> holding a valid cookie correspond to authenticated users and score much
> better in terms of trustability. Thus by having adjustable timeouts, it
> would make it very easy to adjust the queue timeout for a request based
> on the presence of a cookie or not.
>
> Now when you think about it, a reasonable but simple strategy for some of
> the examples above could be summarized to this :
>
> timeout queue 10s
> # pay button not urgent but needs to work
> http-request set-timeout queue 60s if PAY_BUTTON
> http-request set-priority -10s if PAY_BUTTON
>
> # unauthenticated less urgent, can be a DOS
> http-request set-priority -1s if NO_COOKIE || FAVICON
>
> # some elements that need to be deliveered quickly
> http-request set-priority 1s if INDEX_HTML || CSS || JS || ICONS
>
> # auto-completion needs to be fast but no problem if it doesn't work
> http-request set-timeout queue 1s if AUTO_COMPLETION_REQUEST
> http-request set-priority 10s if AUTO_COMPLETION_REQUEST
>
> # inconsistent user-agents are suspicious, delay them just in case
> http-request set-priority -20s if SUSPICIOUS_USER_AGENT
>
> # too fast users need to slow down a little bit
> http-request set-priority -20s if { sc0_rate gt 10 } || { sc0_err_cnt gt 10 }
>
> With a proper API we can even imagine having access to these request
> properties from Lua to implement far more advanced policies.
>
> Regards,
> Willy
Here's an example scenario that a time based queue cannot handle:
Lets say you have a server that can only process 10 requests per second.
Your normal traffic is 2 requests per second (500ms between req). Then
you start getting an attack of 50 requests per second from
SUSPICIOUS_USER_AGENT (20ms between req). The queue will end up with 520
requests (50rps*10s + 2rps*10s # the 10s is the timeout). Then a
INDEX_HTML request comes in. With a `set-priority 1s`, you advance it up
1s in the queue, which puts it ahead of 52 requests, but there's still
468 requests in front of it, which means it'll be 46.8s before it is
processed, and with `timeout queue 10s`, it times out. The only request
which will work is the PAY_BUTTON, and that's because of the increased
timeout, as all the SUSPICIOUS_USER_AGENT requests in front of it will
time out first, and at prio -10s, it'll take 20s to complete.

Lets go through this same scenario with a scoring system. Lets assume
that in `set-priority Xs`, the "Xs" becomes "X", where positive means
higher priority.
The queue will have 500 requests pending (this is different from the 520
above. see the end for explanation). Then a INDEX_HTML request comes in.
With `set-priority 1`, because all the SUSPICIOUS_USER_AGENT requests
have priority -20, the INDEX_HTML goes to the front of the queue and
gets processed as soon as a connection is freed. Lets also assume a
NO_COOKIE comes in, with priority -1, it'll go right after the
INDEX_HTML request. And then because SUSPICIOUS_USER_AGENT is constantly
at the end of the queue, all the good requests are processed in
near-normal time. The only wait is for a connection slot to open up. And
if the normal traffic rate is 2 requests per second and server can
handle 10 per second, there should never be more than 1 good request in
the queue (except if they come in at the same time). This is also why at
the beginning, the queue only has 500 requests pending (the only thing
in the queue is SUSPICIOUS_USER_AGENT, so 50rps*10s=500).

-Patrick
Willy Tarreau
Re: Priority based queuing
May 03, 2018 05:00AM
On Wed, May 02, 2018 at 04:29:33PM -0400, Patrick Hemmer wrote:
> I think you're misunderstanding my design, as scoring wouldn't work like
> this at all. If you give the gold class a score of 1000 (where higher
> number means higher priority), then the only thing that would get
> processed before gold class would be another class that has a score >
> 1000. If nothing does, and a gold request comes in, it gets processed
> first, no matter how big the queue is.
> Think of scoring as instead having multiple queues. Each queue is only
> processed if the queue before it is empty.
(...)

OK you convinced me. Not on everything, but on the fact that we're trying
to address different points. My goal is to make it possible to improve
quality of service between good requests, and your goal is to improve the
classification between good, suspicious, and bad requests. Each of us see
how to expand a little bit our respective model to address part of the
other one (though less efficiently).

I agree that for your goal, multi-queue is better, but I still maintain
that for my goal, the time-based queue gives better control. The good
news is that the two are orthogonal and 100% compatible.

Basically the queueing system should be redesigned as a list of time-based
trees that are visited in order of priority (or traffic classes, we'll have
to find the proper wording to avoid confusion). If you think you can address
your needs with just a small set of different priorities, probably that we
can implement this with a small array (2-4 queues). If you think you need
more, then we'd rather think about building a composite position value in
the tree made of the priority at the top of the word and the service time
at the bottom of the word. This way, picking the first value will always
find the lowest priority value first. There's one subtlety related to
wrapping time there however, but it can be addressed in two lookups.

Please let me know if you'd be fine with designing and implementing
something like this.

Cheers,
Willy
Patrick Hemmer
Re: Priority based queuing
May 03, 2018 05:30AM
On 2018/5/2 16:29, Patrick Hemmer wrote:
>
>
> On 2018/5/2 13:22, Willy Tarreau wrote:
>> On Wed, May 02, 2018 at 12:44:06PM -0400, Patrick Hemmer wrote:
>>> Can you elaborate on what you're thinking of for a time-based queue?
>>>
>>> What I'm imagining you mean is that you would write a rule to set the
>>> max queue time, and haproxy would insert it into the queue sorting on
>>> TIME_NOW() + MAX_QUEUE_TIME. The main difference I see to this approach
>>> vs scoring, is that you ensure that an item doesn't sit on the queue
>>> forever (or whatever `timeout queue` is set to) if higher priority stuff
>>> keeps getting inserted before it.
>> In part, but mostly that under contention you can still maintain a
>> good quality of service. I remember having had a discussion a very
>> long time ago with a gaming site operator explaining that he wanted
>> to send profile management requests to a dedicated slow server so
>> that people filling in their profile do not disturb the ones playing.
>> For me it's a good example : "how long do you think these people are
>> willing to wait for the server to respond to each click to update
>> their profile?". Let's say 2 seconds is OK, you just add a 2s offset
>> to the position in the queue for such requests. If in the mean time
>> other higher priority requests come in, the delayed one may face the
>> 2 seconds delay. But it will not face more. And obviously if no other
>> request is pending before it in the queue it will be served earlier.
>>
>> A similar example is for certain shopping sites which consider that
>> once you have something in your shopping cart, you're much more likely
>> to finish the process and to pay because you've found the item you
>> were looking for, so you're more willing to tolerate a slightly higher
>> response time, and as such provide a better response time to newcomers.
>> But while this small delay can easily be expressed in milliseconds
>> (probably 50/100), it's almost impossible to define it using positions.
>> What if 50 new visitors suddenly come in ? You don't want the guy with
>> his item to suddenly experience a timeout. The time-based queue however
>> grants you a control over the service degradation you're accepting to
>> inflict to your users.
> There's a key difference in our designs. You're trying to classify
> traffic based on destination. I'm trying to classify traffic based on
> source.
> In my design, the maxconn is set to a level such that the server is
> healthy and every request can be processed fast. With this, and a
> score based queue, and under attack, the good requests will be drained
> fast, leaving only the bad requests. Thus no normal user should be
> getting a timeout, no matter what resource they're going to.
> Your design works when the backend system can't keep up with good
> legitimate traffic, and you need to prioritize one good request over
> another good request. My design is when the backend system can't keep
> up with bad traffic, and so we want to process the good traffic first.
>
>> Another interesting example is when you want ot provide strong service
>> guarantees to some top-level visitors while running under intense load.
>> By using only a position, it's easy to say "I want the Gold class to
>> advance by 1000 positions". But if the site is highly loaded and you
>> have 10k pending requests, these 1000 positions remain irrelevant,
>> because the requests sits there, waiting for 90% of the pollution to
>> be flushed.
> I think you're misunderstanding my design, as scoring wouldn't work
> like this at all. If you give the gold class a score of 1000 (where
> higher number means higher priority), then the only thing that would
> get processed before gold class would be another class that has a
> score > 1000. If nothing does, and a gold request comes in, it gets
> processed first, no matter how big the queue is.
> Think of scoring as instead having multiple queues. Each queue is only
> processed if the queue before it is empty.
>> If you say "I want these requests to advance by 10 seconds"
>> then no matter how many other requests you have in the queue, it will
>> really advance by 10 seconds and effectively shrink the response time
>> by 10 seconds.
> It may shrink the response time by 10 seconds, but if the normal
> response time is 60 seconds, that's still 50 seconds of waiting. If
> your users are only willing to put up with 30 seconds of wait, this
> means you end up failing *everyone*.
>
>> Overall for user experience it's important to think in time and not
>> positions. The rationale behind this is simple : the user will never
>> accept as a justification for varying quality of service the number of
>> other visitors. And positions just depend on this, it's an average number
>> of people you're allowing to pass over. If I'm granted a pass ticket
>> allowing me to advance 10 places in the cinema queue and I find a
>> crowded street, I will go back home. If I'm told I won't wait more
>> than 5 minutes, I'll come regardless of the number of waiters.
> But a time based queue wouldn't work like this. You can't guarantee
> service within 5 minutes. What happens if a ton of people arrive at
> the same time who were told "you will never wait more than 5 minutes",
> and the cinema can't handle them all? Or if you're the only person in
> the line and nobody leaves the cinema for 6 minutes? Or there are
> already 100 people in line who were told "you will never wait more
> than 1 hour", and that hour is almost up?
>
>>> I don't think this is necessarily a good thing.
>> For having considered the two options several times in field over the
>> last decade when sites started to crawl, I became very well convinced ;-)
>>
>>> If you're under a DOS
>>> attack, the goal is to get the good requests processed before any
>>> possible malicious requests. With a time based queue, those malicious
>>> requests will still get processed and starve out the good requests.
>> Precisely not that much because the good ones will pass before, and
>> the malicious ones will then be subject to the queue timeout if there
>> are too many.
>>
>>> For
>>> example lets say you're under attack, a bad request comes in with
>>> max_queue_time=1000ms, and then after 999ms elapse, a good request comes
>>> in with max_queue_time=10ms. You have a good request, and a bad request
>>> on the queue, but HAProxy is going to process the bad request first
>>> because its timer is expiring first.
>> Absolutely but you fell into the trap of not considering that the queue
>> is supposed to be full, so you're in fact comparing only two requests,
>> while in practice you have many of them and there the effect is much
>> more visible.
> I don't think so. Server connections are full, not the queue. The
> point I was trying to make is that you have a request you are pretty
> confident is bad, and you waste resources processing it while the
> request you think is good gets to wait. No matter how big the queue
> is, as long as a bad request sits in front of a good request in the
> queue (regardless of whether good request chronologically came after
> bad, or bad request chronologically came after good), this applies.
>>> Essentially if haproxy is receiving
>>> X good requests per second, and Y bad requests per second, it's still
>>> going to forward X good per second, and Y bad per second, to the backend
>>> server. The only difference is that they're time shifted.
>> Except once subject to the queue timeout. It's a very important aspect we
>> must not dismiss.
>>
>>> The other thing I could think you mean by time-based is to insert into
>>> the queue sorting on MAX_QUEUE_TIME, just like a score-based queue, but
>>> you would still record TIME_NOW() + MAX_QUEUE_TIME, and would reject
>>> requests that don't get processed by their deadline. Essentially a
>>> per-request version of the `timeout queue` setting. But I don't see any
>>> real value in this.
>>>
>>> Or do you mean something else?
>> What I mean is that the queue is indexed on the computed service time
>> which is (current_time + offset), where offset can be either positive or
>> negative, null by default. The queue is sorted by service time, just like
>> any timer in fact. Then the queue collection simply picks the next event
>> in the queue.
>>
>> Please also note that right now the queue already considers the time to
>> maintain fairness between requests targetting a specific server and those
>> who just want any server. This is what ensures no request starves in either
>> the server's queue or the backend's queue. With a time-sorted queue, instead
>> of picking the tasks according to their time of arrival, we'd simply pick
>> them based on the service time, and we could maintain an excellent fairness
>> between servers and backends.
>>
>> Regarding the DOS situations, there are certain aspects which slightly
>> differ from the points around the quality of service. DOS fighting involves
>> sacrifice, and basic quality of service hardly provides this by default. In
>> our case, sacrifice happens on a queue size or time. And the decision should
>> depend on the request's importance, which is often very different from the
>> desired response time. The well known example is the "Pay" button on many
>> sites : you click on it, and if it doesn't respond fast, you will not press
>> Escape. You're keeping fingers crossed hoping it will not return an error,
>> even if it takes 30 seconds. And moreover, you're happy once it responds!
>> Here that's what makes the difference with the response time : in fact you'd
>> like to say that certain requests are highly sensitive but not urgent, and
>> that you'd like to be able to increase their timeout and ultimately get
>> served.
>>
>> To fight DOS it's the same. Commonly, many sites consider that requests
>> holding a valid cookie correspond to authenticated users and score much
>> better in terms of trustability. Thus by having adjustable timeouts, it
>> would make it very easy to adjust the queue timeout for a request based
>> on the presence of a cookie or not.
>>
>> Now when you think about it, a reasonable but simple strategy for some of
>> the examples above could be summarized to this :
>>
>> timeout queue 10s
>> # pay button not urgent but needs to work
>> http-request set-timeout queue 60s if PAY_BUTTON
>> http-request set-priority -10s if PAY_BUTTON
>>
>> # unauthenticated less urgent, can be a DOS
>> http-request set-priority -1s if NO_COOKIE || FAVICON
>>
>> # some elements that need to be deliveered quickly
>> http-request set-priority 1s if INDEX_HTML || CSS || JS || ICONS
>>
>> # auto-completion needs to be fast but no problem if it doesn't work
>> http-request set-timeout queue 1s if AUTO_COMPLETION_REQUEST
>> http-request set-priority 10s if AUTO_COMPLETION_REQUEST
>>
>> # inconsistent user-agents are suspicious, delay them just in case
>> http-request set-priority -20s if SUSPICIOUS_USER_AGENT
>>
>> # too fast users need to slow down a little bit
>> http-request set-priority -20s if { sc0_rate gt 10 } || { sc0_err_cnt gt 10 }
>>
>> With a proper API we can even imagine having access to these request
>> properties from Lua to implement far more advanced policies.
>>
>> Regards,
>> Willy
> Here's an example scenario that a time based queue cannot handle:
> Lets say you have a server that can only process 10 requests per
> second. Your normal traffic is 2 requests per second (500ms between
> req). Then you start getting an attack of 50 requests per second from
> SUSPICIOUS_USER_AGENT (20ms between req). The queue will end up with
> 520 requests (50rps*10s + 2rps*10s # the 10s is the timeout). Then a
> INDEX_HTML request comes in. With a `set-priority 1s`, you advance it
> up 1s in the queue, which puts it ahead of 52 requests, but there's
> still 468 requests in front of it, which means it'll be 46.8s before
> it is processed, and with `timeout queue 10s`, it times out. The only
> request which will work is the PAY_BUTTON, and that's because of the
> increased timeout, as all the SUSPICIOUS_USER_AGENT requests in front
> of it will time out first, and at prio -10s, it'll take 20s to complete.
>
> Lets go through this same scenario with a scoring system. Lets assume
> that in `set-priority Xs`, the "Xs" becomes "X", where positive means
> higher priority.
> The queue will have 500 requests pending (this is different from the
> 520 above. see the end for explanation). Then a INDEX_HTML request
> comes in. With `set-priority 1`, because all the SUSPICIOUS_USER_AGENT
> requests have priority -20, the INDEX_HTML goes to the front of the
> queue and gets processed as soon as a connection is freed. Lets also
> assume a NO_COOKIE comes in, with priority -1, it'll go right after
> the INDEX_HTML request. And then because SUSPICIOUS_USER_AGENT is
> constantly at the end of the queue, all the good requests are
> processed in near-normal time. The only wait is for a connection slot
> to open up. And if the normal traffic rate is 2 requests per second
> and server can handle 10 per second, there should never be more than 1
> good request in the queue (except if they come in at the same time).
> This is also why at the beginning, the queue only has 500 requests
> pending (the only thing in the queue is SUSPICIOUS_USER_AGENT, so
> 50rps*10s=500).
>
> -Patrick
>
>
After re-reading this, and thinking through your design again, I think
my understanding of your time-based queuing proposal, and thus my
example for it, was wrong.
The earliest SUSPICIOUS_USER_AGENT request would come in at T=0s, and so
it would go in the queue with key=20 (key = T - priority). Then when
INDEX_HTML comes in at T=10s (the oldest SUSPICIOUS_USER_AGENT is timing
out), it goes into the queue with key=9, which sorts before key=20.

This is better, but still problematic as lets say you had a rule that
did `set-priority -15s`. Then at T=10s, it would get queued at key=25s.
The admin has judged it to be more important than SUSPICIOUS_USER_AGENT
(prio -15s vs. prio -20s), but the server is still going to waste time
processing all those SUSPICIOUS_USER_AGENT requests.

While this is not what I would desire, I can see there might be a use
case for it. So how about something that would support both:

# insert into queue with time-based key
http-request set-priority date(20) if SUSPICIOUS_USER_AGENT

and

# insert into queue with a relative key
http-request set-priority int(20) if SUSPICIOUS_USER_AGENT

The queue is processed in ascending order, so this should handle both.
I would also propose a sample fetch for retrieving, as well as a LUA
method for retrieving & setting, so that you can perform relative
adjustments to it. E.G:

http-request set-priority priority,add(5) if FOO


-Patrick
Patrick Hemmer
Re: Priority based queuing
May 05, 2018 01:00AM
On 2018/5/2 22:56, Willy Tarreau wrote:
> On Wed, May 02, 2018 at 04:29:33PM -0400, Patrick Hemmer wrote:
>> I think you're misunderstanding my design, as scoring wouldn't work like
>> this at all. If you give the gold class a score of 1000 (where higher
>> number means higher priority), then the only thing that would get
>> processed before gold class would be another class that has a score >
>> 1000. If nothing does, and a gold request comes in, it gets processed
>> first, no matter how big the queue is.
>> Think of scoring as instead having multiple queues. Each queue is only
>> processed if the queue before it is empty.
> (...)
>
> OK you convinced me. Not on everything, but on the fact that we're trying
> to address different points. My goal is to make it possible to improve
> quality of service between good requests, and your goal is to improve the
> classification between good, suspicious, and bad requests. Each of us see
> how to expand a little bit our respective model to address part of the
> other one (though less efficiently).
>
> I agree that for your goal, multi-queue is better, but I still maintain
> that for my goal, the time-based queue gives better control. The good
> news is that the two are orthogonal and 100% compatible.
>
> Basically the queueing system should be redesigned as a list of time-based
> trees that are visited in order of priority (or traffic classes, we'll have
> to find the proper wording to avoid confusion). If you think you can address
> your needs with just a small set of different priorities, probably that we
> can implement this with a small array (2-4 queues). If you think you need
> more, then we'd rather think about building a composite position value in
> the tree made of the priority at the top of the word and the service time
> at the bottom of the word. This way, picking the first value will always
> find the lowest priority value first. There's one subtlety related to
> wrapping time there however, but it can be addressed in two lookups.
>
> Please let me know if you'd be fine with designing and implementing
> something like this.
>
> Cheers,
> Willy

I'm not quite following the need for multiple queues. Why wouldn't you
just have one sorted queue, where if multiple pending requests have the
same priority, then they're FIFO.

I had sent another email which proposed an implementation that would
satisfy both designs
(https://www.mail-archive.com/[email protected]/msg29876.html).
Unfortunately I was composing it when you sent your most recent reply,
so I didn't see it before I sent.
I went ahead and implemented the functionality I had talked about in
that other email, and have attached a WIP patch. I've done some testing
on it, and it seems to work fine. The syntax is exactly as proposed:

# time-based priority
http-request set-priority date(20) if LOW_PRIORITY
http-request set-priority date(-20) if HIGH_PRIORITY

# score-based priority
http-request set-priority int(20) if LOW_PRIORITY
http-request set-priority int(-20) if HIGH_PRIORITY

Some notes on the implementation:

I used a `long` for tracking the priority. Obviously this has
limitations when used with date(), as the timestamps are fairly large. I
can change this to `long long`.

The queue was implemented using the existing pendconns code. The only
thing that was changed was to insert in sorted order. Obviously this is
just a quick and dirty implementation. There are lots of priority queue
implementations to choose from, and I don't know which one is best
suited to this kind of task. We could keep the existing sorted
linked-list, and just do some optimizations for insertion. Or we could
switch to some sort of tree. The linked-list has the advantage in that I
suspect that the vast majority of additions will append to the end, and
pops are always off the front, so well suited to a linked list. A tree
might become very unbalanced very fast, and require lots of maintenance.

Using the date() fetch obviously doesn't provide good precision. We
could add a `datems()`, `datens()`, or `ticks()` (ms since start) sample
fetch. But obviously we would need to use `long long` for that, Though
ticks could use a `long`, but would wrap at around 16 months.

I still plan on adding a sample fetch for obtaining the current priority
value, and a lua function for setting it. Not sure if we want a lua
function for retrieving, or if using the fetch is good enough.

-Patrick

From 4783dccfc869cfb4219d195ba21f896272211c9e Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 4 May 2018 16:31:16 -0400
Subject: [PATCH] MINOR: add {http,tcp}-request set-priority for queue
prioritization

This adds the set-priority action to http-request and tcp-request content.
The priority value is used when connections are queued to determine which connections should be served first. The lowest integer value is served first (akin to the 'nice' parameter).
---
include/types/stream.h | 1 +
src/queue.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
src/stream.c | 1 +
3 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/include/types/stream.h b/include/types/stream.h
index 0dbc79f44..8a2eb8beb 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -125,6 +125,7 @@ struct stream {

struct server *srv_conn; /* stream already has a slot on a server and is not in queue */
struct pendconn *pend_pos; /* if not NULL, points to the pending position in the pending queue */
+ long priority; /* the priority of the stream in the pending queue */

struct http_txn *txn; /* current HTTP transaction being processed. Should become a list. */

diff --git a/src/queue.c b/src/queue.c
index 1c730c75c..1a2cea693 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -15,11 +15,14 @@
#include <common/time.h>
#include <common/hathreads.h>

+#include <proto/proto_http.h>
#include <proto/queue.h>
+#include <proto/sample.h>
#include <proto/server.h>
#include <proto/stream.h>
#include <proto/stream_interface.h>
#include <proto/task.h>
+#include <proto/tcp_rules.h>


struct pool_head *pool_head_pendconn;
@@ -195,7 +198,7 @@ void process_srv_queue(struct server *s)
*/
struct pendconn *pendconn_add(struct stream *strm)
{
- struct pendconn *p;
+ struct pendconn *p, *pn;
struct proxy *px;
struct server *srv;

@@ -219,7 +222,10 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->logs.srv_queue_size += srv->nbpend;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
- LIST_ADDQ(&srv->pendconns, &p->list);
+ list_for_each_entry(pn, &srv->pendconns, list)
+ if (p->strm->priority < pn->strm->priority)
+ break;
+ LIST_ADDQ(&pn->list, &p->list);
HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
}
else {
@@ -228,7 +234,10 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->logs.prx_queue_size += px->nbpend;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
- LIST_ADDQ(&px->pendconns, &p->list);
+ list_for_each_entry(pn, &px->pendconns, list)
+ if (p->strm->priority < pn->strm->priority)
+ break;
+ LIST_ADDQ(&pn->list, &p->list);
HA_SPIN_UNLOCK(PROXY_LOCK, &px->lock);
}
HA_ATOMIC_ADD(&px->totpend, 1);
@@ -392,6 +401,64 @@ void pendconn_free(struct pendconn *p)
pool_free(pool_head_pendconn, p);
}

+static enum act_return action_set_priority(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (smp)
+ s->priority = smp->data.u.sint;
+
+ return ACT_RET_CONT;
+}
+
+
+static enum act_parse_ret parse_set_priority(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority;
+ return ACT_RET_PRS_OK;
+}
+
+static struct action_kw_list tcp_cont_kws = {ILH, {
+ { "set-priority", parse_set_priority },
+ { /* END */ }
+}};
+
+static struct action_kw_list http_req_kws = {ILH, {
+ { "set-priority", parse_set_priority },
+ { /* END */ }
+}};
+
+__attribute__((constructor))
+static void __queue_init(void)
+{
+ tcp_req_cont_keywords_register(&tcp_cont_kws);
+ http_req_keywords_register(&http_req_kws);
+}
+
/*
* Local variables:
* c-indent-level: 8
diff --git a/src/stream.c b/src/stream.c
index 1d0b22ca3..9d4bc0bbb 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -220,6 +220,7 @@ struct stream *stream_new(struct session *sess, enum obj_type *origin)
s->target = sess->listener ? sess->listener->default_target : NULL;

s->pend_pos = NULL;
+ s->priority = 0;

/* init store persistence */
s->store_count = 0;
--
2.16.3
Willy Tarreau
Re: Priority based queuing
May 05, 2018 07:40AM
On Fri, May 04, 2018 at 06:49:00PM -0400, Patrick Hemmer wrote:
> I'm not quite following the need for multiple queues. Why wouldn't you
> just have one sorted queue, where if multiple pending requests have the
> same priority, then they're FIFO.

That's what the time-ordered queue does. I proposed this so that you can
ensure that lower priorities are always processed first until there is no
more of the same level.

> I had sent another email which proposed an implementation that would
> satisfy both designs
> (https://www.mail-archive.com/[email protected]/msg29876.html).
> Unfortunately I was composing it when you sent your most recent reply,
> so I didn't see it before I sent.

Thanks, I missed this one indeed.

> I went ahead and implemented the functionality I had talked about in
> that other email, and have attached a WIP patch. I've done some testing
> on it, and it seems to work fine. The syntax is exactly as proposed:
>
> # time-based priority
> http-request set-priority date(20) if LOW_PRIORITY
> http-request set-priority date(-20) if HIGH_PRIORITY
>
> # score-based priority
> http-request set-priority int(20) if LOW_PRIORITY
> http-request set-priority int(-20) if HIGH_PRIORITY
>
> Some notes on the implementation:
>
> I used a `long` for tracking the priority. Obviously this has
> limitations when used with date(), as the timestamps are fairly large. I
> can change this to `long long`.

Ah now I understand how you wanted to implement it. I have a big concern
here with wrapping time however, which is what I explained in an earlier
mail where I said that if needed we could mix date and priority in the
same integer but that would still require two lookups. Our date wraps
every 49.7 days (32-bit millisecond resolution) so depending on the
phases of the moon some requests will be processed before the highest
priority ones or after, and you'll see some priority inversion 7 times
a year.

> The queue was implemented using the existing pendconns code. The only
> thing that was changed was to insert in sorted order. Obviously this is
> just a quick and dirty implementation. There are lots of priority queue
> implementations to choose from, and I don't know which one is best
> suited to this kind of task. We could keep the existing sorted
> linked-list, and just do some optimizations for insertion. Or we could
> switch to some sort of tree. The linked-list has the advantage in that I
> suspect that the vast majority of additions will append to the end, and
> pops are always off the front, so well suited to a linked list. A tree
> might become very unbalanced very fast, and require lots of maintenance.

No, trees are quite cheap, we use them absolutely everywhere now. Just
take a look at wake_expired_tasks() to get an idea. In fact every single
time we've been relying on lists doing linear lookups, we've got several
bug reports about haproxy eating all the CPU because some people were
using it in a way that wasn't initially considered as reasonable but
which made a lot of sense to them. In the case of the list above, the
corner case would be server with "maxconn 5000" and half of the requests
having a high priority and the other half having a low one, resulting in
an average of 2500 visits in the optimal case where you can insert from
both sides (which was what we used to do a very long time ago in the
scheduler).

> Using the date() fetch obviously doesn't provide good precision.

Hmmm good point here it's seconds so it will not wrap for the next 20
years but will provide a useless precision.

> We
> could add a `datems()`, `datens()`, or `ticks()` (ms since start) sample
> fetch. But obviously we would need to use `long long` for that, Though
> ticks could use a `long`, but would wrap at around 16 months.

No, 49.7 days ;-)

The other problem I'm having here is that it artificially mixes two types
of totally unrelated data and relies on the lack of intersection between
their domains (when time doesn't wrap). Indeed, here you're considering
that values below a certain number cannot be dates so they definitely are
fixed priorities. But that also means that people who want to use ranges
will randomly collide with dates and will observe a changing behaviour
along the year.

In this case you'd be better of with the principle consisting in using
a larger integer with the priority at the top and the date at the bottom.
But that's where I said that it would require two lookups to deal with
wrapping date so I'm not sure it's any better than using multiple queues.
Well, at least it's less maintenance burden. The detailed principle is
the following :

uint64_t priority :

63 32 31 0
[ hard priority | date ]

next = lookup_first(queue);
if (!next)
return;

=> next is the lowest value of both priority and date

Then you perform a second lookup with the oldest past date for the same
priority level :

key = (next->key & 0xffffffff00000000ULL) | (now_ms - TIMER_LOOK_BACK);
next2 = lookup_ge(queue, key);
if (next2 && !((next2->value ^ next->value) >> 32))
next=next2

Then you dequeue next which is always the next one.

That's where I thought that using two distinct levels was simpler, but
if we factor in the fact that it would limit the number of priority
classes and would require more maintenance code, it's probably not
worth the hassle to save the 4 extra lines we have above to respect
the wrapping date.

Also I'm thinking that we can even use 32-bit by putting the frontier
between the date and the fixed priority (let's call it class) somewhere
else :
- 8 bit class + 24 bit offset => 256 classes and +/- 2.3 hours offsets
- 12 bit class + 20 bit offset => 4096 classes and +/- 8 minutes offsets
- 16 bit class + 16 bit offset => 64k classes and +/- 32 second offsets

I'm pretty sure that the two first options are way enough to cover almost
everyone's needs. Let's say for a moment that we'd use the second option
(12+20 bits), the queue insertion code becomes this :

pendconn->node.key = (stream->prio_class << 20) +
(now_ms + stream->prio_offset) & 0xFFFFF;
if (srv)
eb32_insert(&srv->pendconns, &pendconn->node);
else
eb32_insert(&px->pendconns, &pendconn->node);


And the dequeuing code in pendconn_process_next_strm() which takes care
both of the backend's and server's queues while still respecting their
classes would become (modulo locking, which is not trivial there but
which has to be preserved regardless of the solution) :

srv_node = eb32_first(&srv->pendconns);
prx_node = eb32_first(&prx->pendconns);
if (!srv_node)
return prx_node;
if (!prx_node)
return srv_node;

if (srv_node->key >> 20 < prx_node->key >> 20)
return srv_node;

if (srv_node->key >> 20 > prx_node->key >> 20)
return prx_node;

/* classes are equal, pick oldest expiration first */
key = (srv_node->key & 0xfff00000) +
(now_ms - (TIMER_LOOK_BACK >> 12)) & 0xFFFFF;

srv_node2 = eb32_lookup_ge(&srv->pendconns, key);
if (srv_node2 && !((srv_node2->value ^ srv_node->value) >> 20))
srv_node = srv_node2;

prx_node2 = eb32_lookup_ge(&prx->pendconns, key);
if (prx_node2 && !((prx_node2->value ^ prx_node->value) >> 20))
prx_node = prx_node2;

/* take the first one in time order */
if ((prx_node->key - srv_node->key) & 0xFFFFF < 0x80000)
return srv_node;
else
return prx_node;

All of this looks fairly reasonable to me and could possibly remove a
bit of the current complexity in the current function which has to
dereference the streams to find the queuing dates.

Then the documentation becomes pretty simple :
- all requests from the lowest numbered priority classes are always
processed first
- within a same priority class, requests are processed in order of
arrival, to which a positive or negative time offset an be applied

We could then have this in http-request and tcp-request :

http-request set-priority-class int(x)
http-request set-priority-offset int(x)

BTW you were right to remind me about using a sample expression instead
of a static integer in the argument. I've been used to use static values
everywhere for years but for such cases, being able to use expressions
is much more useful as we could extract the value from a header or cookie
for example.

Willy
Patrick Hemmer
Re: Priority based queuing
May 05, 2018 07:40PM
On 2018/5/5 01:29, Willy Tarreau wrote:
> On Fri, May 04, 2018 at 06:49:00PM -0400, Patrick Hemmer wrote:
>> I'm not quite following the need for multiple queues. Why wouldn't you
>> just have one sorted queue, where if multiple pending requests have the
>> same priority, then they're FIFO.
> That's what the time-ordered queue does. I proposed this so that you can
> ensure that lower priorities are always processed first until there is no
> more of the same level.
Ah, I was considering the 2 solutions as an either-or choice. Not that
you'd use both in the same configuration, or at least in the same backend.

>> The queue was implemented using the existing pendconns code. The only
>> thing that was changed was to insert in sorted order. Obviously this is
>> just a quick and dirty implementation. There are lots of priority queue
>> implementations to choose from, and I don't know which one is best
>> suited to this kind of task. We could keep the existing sorted
>> linked-list, and just do some optimizations for insertion. Or we could
>> switch to some sort of tree. The linked-list has the advantage in that I
>> suspect that the vast majority of additions will append to the end, and
>> pops are always off the front, so well suited to a linked list. A tree
>> might become very unbalanced very fast, and require lots of maintenance.
> No, trees are quite cheap, we use them absolutely everywhere now. Just
> take a look at wake_expired_tasks() to get an idea. In fact every single
> time we've been relying on lists doing linear lookups, we've got several
> bug reports about haproxy eating all the CPU because some people were
> using it in a way that wasn't initially considered as reasonable but
> which made a lot of sense to them.
This is why I liked the idea of a single implementation. It gave the
flexibility to the user to do whatever they wanted (time or class
based). Granted I was only thinking of doing one or the other, but the
user could still do both, they would just have to manage dedicating a
range to the date/class. Maybe someone would want to consider date
before class.

> In this case you'd be better of with the principle consisting in using
> a larger integer with the priority at the top and the date at the bottom.
> But that's where I said that it would require two lookups to deal with
> wrapping date so I'm not sure it's any better than using multiple queues.
> Well, at least it's less maintenance burden. The detailed principle is
> the following :
>
> uint64_t priority :
>
> 63 32 31 0
> [ hard priority | date ]
>
> next = lookup_first(queue);
> if (!next)
> return;
>
> => next is the lowest value of both priority and date
>
> Then you perform a second lookup with the oldest past date for the same
> priority level :
>
> key = (next->key & 0xffffffff00000000ULL) | (now_ms - TIMER_LOOK_BACK);
> next2 = lookup_ge(queue, key);
> if (next2 && !((next2->value ^ next->value) >> 32))
> next=next2
>
> Then you dequeue next which is always the next one.
>
> That's where I thought that using two distinct levels was simpler, but
> if we factor in the fact that it would limit the number of priority
> classes and would require more maintenance code, it's probably not
> worth the hassle to save the 4 extra lines we have above to respect
> the wrapping date.
>
> Also I'm thinking that we can even use 32-bit by putting the frontier
> between the date and the fixed priority (let's call it class) somewhere
> else :
> - 8 bit class + 24 bit offset => 256 classes and +/- 2.3 hours offsets
> - 12 bit class + 20 bit offset => 4096 classes and +/- 8 minutes offsets
> - 16 bit class + 16 bit offset => 64k classes and +/- 32 second offsets
What's the benefit to using 32-bit over 64-bit, and is that benefit
worth the cost of the added code complexity and processing time? If we
used 64-bit, we could do a 16/48 bit split and have an absolute
timestamp out to year 10889 at millisecond precision.

> I'm pretty sure that the two first options are way enough to cover almost
> everyone's needs.
For me a 12-bit class should work. My intention is to implement
something similar to email spam filtering, where you generate a score
based on the rules that match, and some rules are worth more than
others. 8 bits might be a bit difficult to work within.
I also suspect the 16-bit option (32 second offset) is too small for
date based.

> Let's say for a moment that we'd use the second option
> (12+20 bits), the queue insertion code becomes this :
>
> pendconn->node.key = (stream->prio_class << 20) +
> (now_ms + stream->prio_offset) & 0xFFFFF;
> if (srv)
> eb32_insert(&srv->pendconns, &pendconn->node);
> else
> eb32_insert(&px->pendconns, &pendconn->node);
>
>
> And the dequeuing code in pendconn_process_next_strm() which takes care
> both of the backend's and server's queues while still respecting their
> classes would become (modulo locking, which is not trivial there but
> which has to be preserved regardless of the solution) :
>
> srv_node = eb32_first(&srv->pendconns);
> prx_node = eb32_first(&prx->pendconns);
> if (!srv_node)
> return prx_node;
> if (!prx_node)
> return srv_node;
>
> if (srv_node->key >> 20 < prx_node->key >> 20)
> return srv_node;
>
> if (srv_node->key >> 20 > prx_node->key >> 20)
> return prx_node;
>
> /* classes are equal, pick oldest expiration first */
> key = (srv_node->key & 0xfff00000) +
> (now_ms - (TIMER_LOOK_BACK >> 12)) & 0xFFFFF;
>
> srv_node2 = eb32_lookup_ge(&srv->pendconns, key);
> if (srv_node2 && !((srv_node2->value ^ srv_node->value) >> 20))
> srv_node = srv_node2;
>
> prx_node2 = eb32_lookup_ge(&prx->pendconns, key);
> if (prx_node2 && !((prx_node2->value ^ prx_node->value) >> 20))
> prx_node = prx_node2;
>
> /* take the first one in time order */
> if ((prx_node->key - srv_node->key) & 0xFFFFF < 0x80000)
> return srv_node;
> else
> return prx_node;
>
> All of this looks fairly reasonable to me and could possibly remove a
> bit of the current complexity in the current function which has to
> dereference the streams to find the queuing dates.
>
> Then the documentation becomes pretty simple :
> - all requests from the lowest numbered priority classes are always
> processed first
> - within a same priority class, requests are processed in order of
> arrival, to which a positive or negative time offset an be applied
>
> We could then have this in http-request and tcp-request :
>
> http-request set-priority-class int(x)
> http-request set-priority-offset int(x)
>
>
> BTW you were right to remind me about using a sample expression instead
> of a static integer in the argument. I've been used to use static values
> everywhere for years but for such cases, being able to use expressions
> is much more useful as we could extract the value from a header or cookie
> for example.
>
> Willy

The rest of this all seems fine. I've got no other thoughts to add.

So the question is, which route do we go?
1. 32-bit int (12/20 split)
2. 64-bit int (32/32 split)
3. 64-bit int (16/48 split) with absolute timestamps
4. 64-bit int with a single `set-priority` action. Let the user use the
a new`ms()` fetch for time based. The user does the bit shifting if they
want to use both (using the `mul()` converter, or we can add a `shift()`).

-Patrick
Willy Tarreau
Re: Priority based queuing
May 05, 2018 08:00PM
On Sat, May 05, 2018 at 01:33:51PM -0400, Patrick Hemmer wrote:
> > Also I'm thinking that we can even use 32-bit by putting the frontier
> > between the date and the fixed priority (let's call it class) somewhere
> > else :
> > - 8 bit class + 24 bit offset => 256 classes and +/- 2.3 hours offsets
> > - 12 bit class + 20 bit offset => 4096 classes and +/- 8 minutes offsets
> > - 16 bit class + 16 bit offset => 64k classes and +/- 32 second offsets
> What's the benefit to using 32-bit over 64-bit, and is that benefit
> worth the cost of the added code complexity and processing time?

Slightly smaller memory footprint (doesn't count much), but more
importantly much smaller processing time especially on 32-bit systems.

> If we
> used 64-bit, we could do a 16/48 bit split and have an absolute
> timestamp out to year 10889 at millisecond precision.

Not really because that would force us to implement yet another clock
and then to have two distinct set of clocks for queues and for all other
internal events and timeouts (including the queue timeout!). That would
really be the beginning of a mess. We've been running with 32-bit
millisecond-precise ticks for over a decade without ever a complain that
it was not enough nor that the wrapping at 49.7 days was an issue. If
one day we'd decide to change this, it would be to support finer-grained
timestamps (eg: nanoseconds or CPU cycles) which would not necessarily
protect us from wrapping either.

> > I'm pretty sure that the two first options are way enough to cover almost
> > everyone's needs.
> For me a 12-bit class should work. My intention is to implement
> something similar to email spam filtering, where you generate a score
> based on the rules that match, and some rules are worth more than
> others. 8 bits might be a bit difficult to work within.

Makes sense.

> I also suspect the 16-bit option (32 second offset) is too small for
> date based.

Yes I agree. I'm pretty sure that for some TCP connections certain users
will want to have a long delay (eg: wait a minute before your FTP transfer
starts).

> The rest of this all seems fine. I've got no other thoughts to add.

Great, thanks!

> So the question is, which route do we go?
> 1. 32-bit int (12/20 split)
> 2. 64-bit int (32/32 split)

I think you can start with the 12/20 split first, knowing that if
anybody ever requests an extension of either time or classes we can
switch to 64 without much effort. Until this happens we'll be more
efficient on small systems. Conversely we can't do it the other way
around because we could break some setups :-)

> 3. 64-bit int (16/48 split) with absolute timestamps

I don't like it much for the reasons above.

> 4. 64-bit int with a single `set-priority` action. Let the user use the
> a new`ms()` fetch for time based. The user does the bit shifting if they
> want to use both (using the `mul()` converter, or we can add a `shift()`).

I think this level of complexity ought to be done once in the code and not
each and every time for all users. It would only be manageable for a number
of the people on this list, but newcomers would call us words for this :-)

Thanks,
Willy
Patrick Hemmer
[PATCH 0/2] Re: Priority based queuing
May 10, 2018 07:40AM
On 2018/5/5 13:55, Willy Tarreau wrote:
> On Sat, May 05, 2018 at 01:33:51PM -0400, Patrick Hemmer wrote:
>>> Also I'm thinking that we can even use 32-bit by putting the frontier
>>> between the date and the fixed priority (let's call it class) somewhere
>>> else :
>>> - 8 bit class + 24 bit offset => 256 classes and +/- 2.3 hours
offsets
>>> - 12 bit class + 20 bit offset => 4096 classes and +/- 8 minutes
offsets
>>> - 16 bit class + 16 bit offset => 64k classes and +/- 32 second
offsets
>> What's the benefit to using 32-bit over 64-bit, and is that benefit
>> worth the cost of the added code complexity and processing time?
>
> Slightly smaller memory footprint (doesn't count much), but more
> importantly much smaller processing time especially on 32-bit systems.
I would have expected the extra processing for wrapping to negate the
benefit.

>
>> If we
>> used 64-bit, we could do a 16/48 bit split and have an absolute
>> timestamp out to year 10889 at millisecond precision.
>
> Not really because that would force us to implement yet another clock
> and then to have two distinct set of clocks for queues and for all other
> internal events and timeouts (including the queue timeout!).
We already have the `now` and `date` variables. The priority is only
used for the queue key, and not to replace anything else. The queue key
is already getting replaced, so it would be no additional work.

> That would
> really be the beginning of a mess. We've been running with 32-bit
> millisecond-precise ticks for over a decade without ever a complain that
> it was not enough nor that the wrapping at 49.7 days was an issue. If
> one day we'd decide to change this, it would be to support finer-grained
> timestamps (eg: nanoseconds or CPU cycles) which would not necessarily
> protect us from wrapping either.
>
>>> I'm pretty sure that the two first options are way enough to cover
almost
>>> everyone's needs.
>> For me a 12-bit class should work. My intention is to implement
>> something similar to email spam filtering, where you generate a score
>> based on the rules that match, and some rules are worth more than
>> others. 8 bits might be a bit difficult to work within.
>
> Makes sense.
>
>> I also suspect the 16-bit option (32 second offset) is too small for
>> date based.
>
> Yes I agree. I'm pretty sure that for some TCP connections certain users
> will want to have a long delay (eg: wait a minute before your FTP transfer
> starts).
>
>> The rest of this all seems fine. I've got no other thoughts to add.
>
> Great, thanks!
>
>> So the question is, which route do we go?
>> 1. 32-bit int (12/20 split)
>> 2. 64-bit int (32/32 split)
>
> I think you can start with the 12/20 split first, knowing that if
> anybody ever requests an extension of either time or classes we can
> switch to 64 without much effort. Until this happens we'll be more
> efficient on small systems. Conversely we can't do it the other way
> around because we could break some setups :-)
>
>> 3. 64-bit int (16/48 split) with absolute timestamps
>
> I don't like it much for the reasons above.
>
>> 4. 64-bit int with a single `set-priority` action. Let the user use the
>> a new`ms()` fetch for time based. The user does the bit shifting if they
>> want to use both (using the `mul()` converter, or we can add a
`shift()`).
>
> I think this level of complexity ought to be done once in the code and not
> each and every time for all users. It would only be manageable for a
number
> of the people on this list, but newcomers would call us words for this :-)
>
> Thanks,
> Willy

Ok, so this patch set includes a fully functional implementation of the
priority class & offset. Unfortunately handling the offset got really
complicated due to the wrapping & locking.
There might be a little room for improvement, but I don't think much.

I'm still working on the code, but I wanted to submit this for review as
the final version probably won't be much different.

The second patch is a test I wrote to try and catch all the different
scenarios. I'm including in case you were inclined to play with the tree
search algorithm and test it. It's not intended to be merged.

Things to do:
* Try and optimize the macros & tree search.
* Output a warning if `timeout queue` > max offset.
* Add LUA methods & sample fetches.
* General cleanup.
* Docs

-Patrick


Patrick Hemmer (2):
MEDIUM: add set-priority-class and set-priority-offset
tests for queue set-priority

Makefile.test | 21 ++++
include/types/proxy.h | 2 +-
include/types/queue.h | 2 +-
include/types/server.h | 2 +-
include/types/stream.h | 2 +
src/proxy.c | 2 +-
src/queue.c | 280
++++++++++++++++++++++++++++++++++++++++++-------
src/server.c | 2 +-
src/stream.c | 2 +
tests/test-queue.cfg | 8 ++
tests/test_queue.c | 117 +++++++++++++++++++++
11 files changed, 396 insertions(+), 44 deletions(-)
create mode 100644 Makefile.test
create mode 100644 tests/test-queue.cfg
create mode 100644 tests/test_queue.c

--
2.16.3
This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content.
The priority values are used when connections are queued to determine
which connections should be served first. The lowest priority class is
served first. When multiple requests from the same class are found, the
earliest (according to queue_time + offset) is served first.
---
include/types/proxy.h | 2 +-
include/types/queue.h | 2 +-
include/types/server.h | 2 +-
include/types/stream.h | 2 +
src/proxy.c | 2 +-
src/queue.c | 280
++++++++++++++++++++++++++++++++++++++++++-------
src/server.c | 2 +-
src/stream.c | 2 +
8 files changed, 250 insertions(+), 44 deletions(-)
Patrick Hemmer
[PATCH 2/2] tests for queue set-priority
May 10, 2018 07:40AM
---
Makefile.test | 21 +++++++++
tests/test-queue.cfg | 8 ++++
tests/test_queue.c | 117
+++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 146 insertions(+)
On 2018/5/10 01:35, Patrick Hemmer wrote:
> This adds the set-priority-class and set-priority-offset actions to
> http-request and tcp-request content.
> The priority values are used when connections are queued to determine
> which connections should be served first. The lowest priority class is
> served first. When multiple requests from the same class are found, the
> earliest (according to queue_time + offset) is served first.
> ---
> include/types/proxy.h | 2 +-
> include/types/queue.h | 2 +-
> include/types/server.h | 2 +-
> include/types/stream.h | 2 +
> src/proxy.c | 2 +-
> src/queue.c | 280
> ++++++++++++++++++++++++++++++++++++++++++-------
> src/server.c | 2 +-
> src/stream.c | 2 +
> 8 files changed, 250 insertions(+), 44 deletions(-)
>
>
Noticed I had some uncommitted changes. Not much, but previous version
might not work right.

-Patrick
From 240a54488c01c4b0a15a561b8e7533922a487492 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 4 May 2018 16:31:16 -0400
Subject: [PATCH] MEDIUM: add set-priority-class and set-priority-offset

This adds the set-priority-class and set-priority-offset actions to http-request and tcp-request content.
The priority values are used when connections are queued to determine which connections should be served first. The lowest priority class is served first. When multiple requests from the same class are found, the earliest (according to queue_time + offset) is served first.
---
include/types/proxy.h | 2 +-
include/types/queue.h | 2 +-
include/types/server.h | 2 +-
include/types/stream.h | 2 +
src/hlua.c | 4 +-
src/proxy.c | 2 +-
src/queue.c | 280 ++++++++++++++++++++++++++++++++++++++++++-------
src/server.c | 2 +-
src/stream.c | 2 +
9 files changed, 252 insertions(+), 46 deletions(-)

diff --git a/include/types/proxy.h b/include/types/proxy.h
index 16c13a1c1..2cc1dfd9e 100644
--- a/include/types/proxy.h
+++ b/include/types/proxy.h
@@ -322,7 +322,7 @@ struct proxy {
int serverfin; /* timeout to apply to server half-closed connections */
} timeout;
char *id, *desc; /* proxy id (name) and description */
- struct list pendconns; /* pending connections with no server assigned yet */
+ struct eb_root pendconns; /* pending connections with no server assigned yet */
int nbpend; /* number of pending connections with no server assigned yet */
int totpend; /* total number of pending connections on this instance (for stats) */
unsigned int feconn, beconn; /* # of active frontend and backends streams */
diff --git a/include/types/queue.h b/include/types/queue.h
index 42dbbd047..03377da69 100644
--- a/include/types/queue.h
+++ b/include/types/queue.h
@@ -35,7 +35,7 @@ struct pendconn {
struct stream *strm;
struct proxy *px;
struct server *srv; /* the server we are waiting for, may be NULL */
- struct list list; /* next pendconn */
+ struct eb32_node node;
__decl_hathreads(HA_SPINLOCK_T lock);
};

diff --git a/include/types/server.h b/include/types/server.h
index 0cd20c096..5339c911e 100644
--- a/include/types/server.h
+++ b/include/types/server.h
@@ -213,7 +213,7 @@ struct server {
struct freq_ctr sess_per_sec; /* sessions per second on this server */
struct be_counters counters; /* statistics counters */

- struct list pendconns; /* pending connections */
+ struct eb_root pendconns; /* pending connections */
struct list actconns; /* active connections */
struct list *priv_conns; /* private idle connections attached to stream interfaces */
struct list *idle_conns; /* sharable idle connections attached or not to a stream interface */
diff --git a/include/types/stream.h b/include/types/stream.h
index 0dbc79f44..489a0b648 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -125,6 +125,8 @@ struct stream {

struct server *srv_conn; /* stream already has a slot on a server and is not in queue */
struct pendconn *pend_pos; /* if not NULL, points to the pending position in the pending queue */
+ int16_t priority_class; /* priority class of the stream for the pending queue */
+ int32_t priority_offset; /* priority offset of the stream for the pending queue */

struct http_txn *txn; /* current HTTP transaction being processed. Should become a list. */

diff --git a/src/hlua.c b/src/hlua.c
index 65fa62ff6..1df8edfd3 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -7859,7 +7859,7 @@ void hlua_init(void)
socket_tcp.proxy = &socket_proxy;
socket_tcp.obj_type = OBJ_TYPE_SERVER;
LIST_INIT(&socket_tcp.actconns);
- LIST_INIT(&socket_tcp.pendconns);
+ socket_tcp.pendconns = EB_ROOT;
socket_tcp.priv_conns = NULL;
socket_tcp.idle_conns = NULL;
socket_tcp.safe_conns = NULL;
@@ -7905,7 +7905,7 @@ void hlua_init(void)
socket_ssl.proxy = &socket_proxy;
socket_ssl.obj_type = OBJ_TYPE_SERVER;
LIST_INIT(&socket_ssl.actconns);
- LIST_INIT(&socket_ssl.pendconns);
+ socket_ssl.pendconns = EB_ROOT;
socket_tcp.priv_conns = NULL;
socket_tcp.idle_conns = NULL;
socket_tcp.safe_conns = NULL;
diff --git a/src/proxy.c b/src/proxy.c
index 31253f14d..63a076855 100644
--- a/src/proxy.c
+++ b/src/proxy.c
@@ -726,7 +726,7 @@ void init_new_proxy(struct proxy *p)
{
memset(p, 0, sizeof(struct proxy));
p->obj_type = OBJ_TYPE_PROXY;
- LIST_INIT(&p->pendconns);
+ p->pendconns = EB_ROOT;
LIST_INIT(&p->acl);
LIST_INIT(&p->http_req_rules);
LIST_INIT(&p->http_res_rules);
diff --git a/src/queue.c b/src/queue.c
index 1c730c75c..0174ddc75 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -14,12 +14,16 @@
#include <common/memory.h>
#include <common/time.h>
#include <common/hathreads.h>
+#include <eb32tree.h>

+#include <proto/proto_http.h>
#include <proto/queue.h>
+#include <proto/sample.h>
#include <proto/server.h>
#include <proto/stream.h>
#include <proto/stream_interface.h>
#include <proto/task.h>
+#include <proto/tcp_rules.h>


struct pool_head *pool_head_pendconn;
@@ -76,8 +80,69 @@ static void pendconn_unlink(struct pendconn *p)
else
p->px->nbpend--;
HA_ATOMIC_SUB(&p->px->totpend, 1);
- LIST_DEL(&p->list);
- LIST_INIT(&p->list);
+ eb32_delete(&p->node);
+}
+
+#define offset_boundary() (now_ms - (TIMER_LOOK_BACK >> 12) & 0xfffff)
+#define key_class(key) (key & 0xfff00000)
+#define key_offset(key) (key & 0xfffff)
+#define key_class_boundary(key) (key_class(key) | offset_boundary())
+static u32 key_incr(u32 key) {
+ u32 key_next;
+
+ key_next = key + 1;
+ if (key_class(key_next) != key_class(key))
+ key_next = key_class(key_next);
+ else if (key_next == key_class_boundary(key))
+ key_next += 0x100000;
+
+ return key_next;
+}
+
+static struct pendconn *next_pendconn(struct eb_root *pendconns, u32 min) {
+ struct eb32_node *node, *node2 = NULL;
+ u32 max;
+
+ // min is inclusive
+ // max is exclusive
+ max = key_class_boundary(min);
+
+ node = eb32_lookup_ge(pendconns, min);
+
+ if (node) {
+ if (node->key < max || (max <= min && key_class(node->key) == key_class(min)))
+ return eb32_entry(node, struct pendconn, node);
+ if (key_class(node->key) != key_class(min))
+ node2 = node;
+ if (max > min)
+ goto class_next;
+ }
+
+ if (max <= min)
+ node = eb32_lookup_ge(pendconns, key_class(min));
+ if (!node)
+ return NULL;
+ if (node->key < max && node->key < min)
+ return eb32_entry(node, struct pendconn, node);
+
+class_next:
+ if (node2) {
+ min = key_class_boundary(node2->key);
+ if (node2->key >= min)
+ return eb32_entry(node2, struct pendconn, node);
+ } else
+ min = key_class_boundary(min) + 0x100000;
+ node = eb32_lookup_ge(pendconns, min);
+ if (node && key_class(node->key) == key_class(min))
+ return eb32_entry(node, struct pendconn, node);
+ if (node2)
+ return eb32_entry(node2, struct pendconn, node);
+ // theoretically this should never find anything we haven't already found
+ //node = eb32_lookup_ge(pendconns, key_class(min));
+ //if (node)
+ // return eb32_entry(node, struct pendconn, node);
+
+ return NULL;
}

/* Process the next pending connection from either a server or a proxy, and
@@ -100,8 +165,9 @@ static void pendconn_unlink(struct pendconn *p)
*/
static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
{
- struct pendconn *p = NULL;
+ struct pendconn *p = NULL, *pp = NULL;
struct server *rsrv;
+ u32 pkey, ppkey;
int remote;

rsrv = srv->track;
@@ -109,38 +175,53 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
rsrv = srv;

if (srv->nbpend) {
- list_for_each_entry(p, &srv->pendconns, list) {
+ for (p = next_pendconn(&srv->pendconns, offset_boundary());
+ p;
+ p = next_pendconn(&srv->pendconns, key_incr(p->node.key)))
if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &p->lock))
- goto ps_found;
- }
- p = NULL;
+ break;
}
-
- ps_found:
- if (srv_currently_usable(rsrv) && px->nbpend) {
- struct pendconn *pp;
-
- list_for_each_entry(pp, &px->pendconns, list) {
- /* If the server pendconn is older than the proxy one,
- * we process the server one. */
- if (p && !tv_islt(&pp->strm->logs.tv_request, &p->strm->logs.tv_request))
- goto pendconn_found;
-
- if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &pp->lock)) {
- /* Let's switch from the server pendconn to the
- * proxy pendconn. Don't forget to unlock the
- * server pendconn, if any. */
- if (p)
- HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
- p = pp;
- goto pendconn_found;
- }
- }
+ if (px->nbpend) {
+ for (pp = next_pendconn(&px->pendconns, offset_boundary());
+ pp;
+ pp = next_pendconn(&px->pendconns, key_incr(pp->node.key)))
+ if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &pp->lock))
+ break;
}

- if (!p)
+ if (!p && !pp)
return 0;

+ if (p && !pp)
+ goto pendconn_found;
+ if (pp && !p) {
+ p = pp;
+ goto pendconn_found;
+ }
+ if (key_class(p->node.key) < key_class(pp->node.key)) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &pp->lock);
+ goto pendconn_found;
+ }
+ if (key_class(pp->node.key) < key_class(p->node.key)) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
+ p = pp;
+ goto pendconn_found;
+ }
+
+ pkey = key_offset(p->node.key);
+ ppkey = key_offset(pp->node.key);
+ if (pkey < offset_boundary())
+ pkey += (TIMER_LOOK_BACK >> 12);
+ if (ppkey < offset_boundary())
+ ppkey += (TIMER_LOOK_BACK >> 12);
+ if (pkey <= ppkey) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &pp->lock);
+ goto pendconn_found;
+ }
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
+ p = pp;
+ goto pendconn_found;
+
pendconn_found:
pendconn_unlink(p);
p->strm_flags |= SF_ASSIGNED;
@@ -206,6 +287,7 @@ struct pendconn *pendconn_add(struct stream *strm)
srv = objt_server(strm->target);
px = strm->be;

+ p->node.key = ((u32)(strm->priority_class + 0x7ff) << 20) | ((u32)(now_ms + strm->priority_offset) & 0xfffff);
p->srv = NULL;
p->px = px;
p->strm = strm;
@@ -219,7 +301,7 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->logs.srv_queue_size += srv->nbpend;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
- LIST_ADDQ(&srv->pendconns, &p->list);
+ eb32_insert(&srv->pendconns, &p->node);
HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
}
else {
@@ -228,7 +310,7 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->logs.prx_queue_size += px->nbpend;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
- LIST_ADDQ(&px->pendconns, &p->list);
+ eb32_insert(&px->pendconns, &p->node);
HA_SPIN_UNLOCK(PROXY_LOCK, &px->lock);
}
HA_ATOMIC_ADD(&px->totpend, 1);
@@ -241,7 +323,8 @@ struct pendconn *pendconn_add(struct stream *strm)
*/
int pendconn_redistribute(struct server *s)
{
- struct pendconn *p, *pback;
+ struct pendconn *p;
+ struct eb32_node *node;
int xferred = 0;
int remote = 0;

@@ -251,7 +334,10 @@ int pendconn_redistribute(struct server *s)
return 0;

HA_SPIN_LOCK(SERVER_LOCK, &s->lock);
- list_for_each_entry_safe(p, pback, &s->pendconns, list) {
+ for (node = eb32_first(&s->pendconns);
+ node;
+ node = eb32_lookup_ge(&s->pendconns, node->key + 1)) {
+ p = eb32_entry(&node, struct pendconn, node);
if (p->strm_flags & SF_FORCE_PRST)
continue;

@@ -280,7 +366,8 @@ int pendconn_redistribute(struct server *s)
*/
int pendconn_grab_from_px(struct server *s)
{
- struct pendconn *p, *pback;
+ struct pendconn *p;
+ struct eb32_node *node;
int maxconn, xferred = 0;
int remote = 0;

@@ -289,7 +376,10 @@ int pendconn_grab_from_px(struct server *s)

HA_SPIN_LOCK(PROXY_LOCK, &s->proxy->lock);
maxconn = srv_dynamic_maxconn(s);
- list_for_each_entry_safe(p, pback, &s->proxy->pendconns, list) {
+ for (node = eb32_first(&s->proxy->pendconns);
+ node;
+ node = eb32_lookup_ge(&s->proxy->pendconns, node->key + 1)) {
+ p = eb32_entry(&node, struct pendconn, node);
if (s->maxconn && s->served + xferred >= maxconn)
break;

@@ -337,7 +427,7 @@ int pendconn_dequeue(struct stream *strm)

/* the pendconn is still linked to the server/proxy queue, so unlock it
* and go away. */
- if (!LIST_ISEMPTY(&p->list)) {
+ if (p->node.node.leaf_p) {
HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
return 1;
}
@@ -369,19 +459,19 @@ void pendconn_free(struct pendconn *p)
HA_SPIN_LOCK(PENDCONN_LOCK, &p->lock);

/* The pendconn was already unlinked, just release it. */
- if (LIST_ISEMPTY(&p->list))
+ if (!p->node.node.leaf_p)
goto release;

if (p->srv) {
HA_SPIN_LOCK(SERVER_LOCK, &p->srv->lock);
p->srv->nbpend--;
- LIST_DEL(&p->list);
+ eb32_delete(&p->node);
HA_SPIN_UNLOCK(SERVER_LOCK, &p->srv->lock);
}
else {
HA_SPIN_LOCK(PROXY_LOCK, &p->px->lock);
p->px->nbpend--;
- LIST_DEL(&p->list);
+ eb32_delete(&p->node);
HA_SPIN_UNLOCK(PROXY_LOCK, &p->px->lock);
}
HA_ATOMIC_SUB(&p->px->totpend, 1);
@@ -392,6 +482,118 @@ void pendconn_free(struct pendconn *p)
pool_free(pool_head_pendconn, p);
}

+static enum act_return action_set_priority_class(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (!smp)
+ return ACT_RET_CONT;
+
+ if (smp->data.u.sint < -0x7ff)
+ smp->data.u.sint = -0x7ff;
+ else if (smp->data.u.sint > 0x7ff)
+ smp->data.u.sint = 0x7ff;
+
+ s->priority_class = smp->data.u.sint;
+ return ACT_RET_CONT;
+}
+
+static enum act_return action_set_priority_offset(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (!smp)
+ return ACT_RET_CONT;
+
+ if (smp->data.u.sint < -0x7ffff)
+ smp->data.u.sint = -0x7ffff;
+ else if (smp->data.u.sint > 0x7ffff)
+ smp->data.u.sint = 0x7ffff;
+
+ s->priority_offset = smp->data.u.sint;
+
+ return ACT_RET_CONT;
+}
+
+static enum act_parse_ret parse_set_priority_class(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority_class;
+ return ACT_RET_PRS_OK;
+}
+
+static enum act_parse_ret parse_set_priority_offset(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority_offset;
+ return ACT_RET_PRS_OK;
+}
+
+static struct action_kw_list tcp_cont_kws = {ILH, {
+ { "set-priority-class", parse_set_priority_class },
+ { "set-priority-offset", parse_set_priority_offset },
+ { /* END */ }
+}};
+
+static struct action_kw_list http_req_kws = {ILH, {
+ { "set-priority-class", parse_set_priority_class },
+ { "set-priority-offset", parse_set_priority_offset },
+ { /* END */ }
+}};
+
+__attribute__((constructor))
+static void __queue_init(void)
+{
+ tcp_req_cont_keywords_register(&tcp_cont_kws);
+ http_req_keywords_register(&http_req_kws);
+}
+
/*
* Local variables:
* c-indent-level: 8
diff --git a/src/server.c b/src/server.c
index ebac357fb..5617d8d67 100644
--- a/src/server.c
+++ b/src/server.c
@@ -1575,7 +1575,7 @@ static struct server *new_server(struct proxy *proxy)
srv->obj_type = OBJ_TYPE_SERVER;
srv->proxy = proxy;
LIST_INIT(&srv->actconns);
- LIST_INIT(&srv->pendconns);
+ srv->pendconns = EB_ROOT;

if ((srv->priv_conns = calloc(global.nbthread, sizeof(*srv->priv_conns))) == NULL)
goto free_srv;
diff --git a/src/stream.c b/src/stream.c
index 1d0b22ca3..9b22becc9 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -220,6 +220,8 @@ struct stream *stream_new(struct session *sess, enum obj_type *origin)
s->target = sess->listener ? sess->listener->default_target : NULL;

s->pend_pos = NULL;
+ s->priority_class = 0;
+ s->priority_offset = 0;

/* init store persistence */
s->store_count = 0;
--
2.16.3
Patrick Hemmer
[PATCH 0/2] Re: Priority based queuing
May 11, 2018 07:00PM
Ok, so here is the full submission for priority based queuing.

Notes since previous update:

Wasn't really able to optimize the tree search any. Tried a few things,
but nothing made a measurable performance difference.

I added a warning message and documentation making clear the issues with
timestamp wrapping.
Though one thing that might not be completely obvious is that even if
the user does not configure `set-priority-offset` at all, they're still
susceptible to the wrapping issue as the priority is the queue key
whether priority is adjusted or not.

The implementation of the %sq (srv_queue) and %bq (backend_queue) was
change to keep the description accurate. The description is "number of
requests which were processed before this one". The previous
implementation just stored the size of the queue at the time the
connection was queued. Since we can inject a connection into the middle
of the queue, this no longer works. Now we keep a count of dequeued
connections, and take the difference between when the connection was
queued, and then dequeued. This also means the value will be slightly
different even for users who don't use priority, as the previous method
would have included connections which closed without being processed.

I added sample fetches for retrieving the class/offset of the current
transaction. I think it might be beneficial to add some other fetches
for tracking the health of the queue, such as average class/offset, or
an exponential moving average of the class/offset for requests added to
the queue, requests processed, and requests which closed/timed out. But
this is just more stuff the code would have to store, so unsure if
they're worth it.


I wasn't convinced the 64-bit key was a bad idea, so I implemented the
idea with a 12/52 split and an absolute timestamp. On my system (which
is 64-bit) the performance is about 20% faster. The code is much
simpler. And it also solves the limitations and issues with wrapping.
The patch for this is included in case it's of interest.

-Patrick


Patrick Hemmer (2):
MEDIUM: add set-priority-class and set-priority-offset
use a 64-bit int with absolute timestamp for priority-offset

doc/configuration.txt | 38 +++++++
doc/lua-api/index.rst | 18 ++++
include/types/proxy.h | 3 +-
include/types/queue.h | 2 +-
include/types/server.h | 3 +-
include/types/stream.h | 7 +-
src/cfgparse.c | 15 +++
src/hlua.c | 69 +++++++++----
src/log.c | 4 +-
src/proto_http.c | 4 +-
src/proxy.c | 2 +-
src/queue.c | 261
+++++++++++++++++++++++++++++++++++++++++--------
src/server.c | 2 +-
src/stream.c | 10 +-
14 files changed, 366 insertions(+), 72 deletions(-)

--
2.16.3
This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content.
The priority values are used when connections are queued to determine
which connections should be served first. The lowest priority class is
served first. When multiple requests from the same class are found, the
earliest (according to queue_time + offset) is served first.
---
doc/configuration.txt | 38 ++++++
doc/lua-api/index.rst | 18 +++
include/types/proxy.h | 3 +-
include/types/queue.h | 2 +-
include/types/server.h | 3 +-
include/types/stream.h | 7 +-
src/cfgparse.c | 15 +++
src/hlua.c | 74 ++++++++---
src/log.c | 4 +-
src/proto_http.c | 4 +-
src/proxy.c | 2 +-
src/queue.c | 345
++++++++++++++++++++++++++++++++++++++++++-------
src/server.c | 2 +-
src/stream.c | 10 +-
14 files changed, 453 insertions(+), 74 deletions(-)
---
include/types/queue.h | 2 +-
src/hlua.c | 5 --
src/queue.c | 144
+++++++++++---------------------------------------
3 files changed, 33 insertions(+), 118 deletions(-)
On 2018/5/11 12:52, Patrick Hemmer wrote:
> This adds the set-priority-class and set-priority-offset actions to
> http-request and tcp-request content.
> The priority values are used when connections are queued to determine
> which connections should be served first. The lowest priority class is
> served first. When multiple requests from the same class are found, the
> earliest (according to queue_time + offset) is served first.
> ---
> doc/configuration.txt | 38 ++++++
> doc/lua-api/index.rst | 18 +++
> include/types/proxy.h | 3 +-
> include/types/queue.h | 2 +-
> include/types/server.h | 3 +-
> include/types/stream.h | 7 +-
> src/cfgparse.c | 15 +++
> src/hlua.c | 74 ++++++++---
> src/log.c | 4 +-
> src/proto_http.c | 4 +-
> src/proxy.c | 2 +-
> src/queue.c | 345
> ++++++++++++++++++++++++++++++++++++++++++-------
> src/server.c | 2 +-
> src/stream.c | 10 +-
> 14 files changed, 453 insertions(+), 74 deletions(-)
>
>
I found one issue I'll need to fix. During final code cleanup I
accidentally replaced good code with the `key_incr` function in 2 places
(pendconn_redistribute and pendconn_grab_from_px). Will fix and submit
new patch pending feedback.

-Patrick
Sorry, only registered users may post in this forum.

Click here to login