Welcome! Log In Create A New Profile

Advanced

Priority based queuing

Posted by Patrick Hemmer 
Patrick Hemmer
Priority based queuing
May 02, 2018 03:40AM
Would it be possible to add priority based queuing to haproxy? By this I
mean that when a server/backend is full (maxconn), that incoming
requests would be added to the queue in a custom order. The idea here is
that when the system is under stress, to make sure the important
requests get handled first.

In our exact use case, we're looking to use this to help mitigate DOS
attacks. The idea is that if a layer 7 attack is saturating the backend
servers, we can add logic to prioritize the requests. This logic might
be things like requests that have a valid application cookie go to the
front of the queue, or requests that come from a cloud provider (e.g.
EC2) go to the back of the queue.
DOS mitigation is hard because while you can write rules to identify
requests that are suspicious, you don't want to block them outright as
it is possible they might be legitimate. With prioritization, the
requests still get through, and are only affected when the backend is
saturated. If maxconn is not reached, the prioritization has no effect
at all (since queue is empty).

I made the change to haproxy and simulated the conditions in a lab, and
the strategy appears to work.
The change to haproxy was very minor, ~10 lines in queue.c, using
`task->nice` as the prioritization key. However my change is a very
rough PoC, and not worthy of submission.
So before continuing any further down this route, I wanted to see if
this is something that could make it into HAProxy, and what any thoughts
on it might be.

-Patrick
Willy Tarreau
Re: Priority based queuing
May 02, 2018 05:10PM
On Tue, May 01, 2018 at 09:34:14PM -0400, Patrick Hemmer wrote:
> Would it be possible to add priority based queuing to haproxy? By this I
> mean that when a server/backend is full (maxconn), that incoming
> requests would be added to the queue in a custom order. The idea here is
> that when the system is under stress, to make sure the important
> requests get handled first.

Hehe that's fun that you mention this, as this has been postponed since
around 1.2 or 1.3! By then we didn't have the equivalent of HTTP rules
to add/subtract some priority. Now we have everything to do it, we "just"
need to replace the lists with priority trees in the queues and that's
all. It's not a big work if someone is interested in working on this.

> In our exact use case, we're looking to use this to help mitigate DOS
> attacks. The idea is that if a layer 7 attack is saturating the backend
> servers, we can add logic to prioritize the requests. This logic might
> be things like requests that have a valid application cookie go to the
> front of the queue, or requests that come from a cloud provider (e.g.
> EC2) go to the back of the queue.

That's exactly why I wanted them to be manipulated vi http-request rules,
so that everyone can construct his own conditions. Also I found that for
most shopping sites, having time-based priority is more important than
position-based : you often want this type of request to be processed 100ms
faster than another type of request. With HTTP/2 it will be even more
interesting because it will allow to send the important objects used for
rendering before the other ones, which is very similar to the H2 priority
but more fine-grained if you can adjust it on the fly.

> DOS mitigation is hard because while you can write rules to identify
> requests that are suspicious, you don't want to block them outright as
> it is possible they might be legitimate. With prioritization, the
> requests still get through, and are only affected when the backend is
> saturated. If maxconn is not reached, the prioritization has no effect
> at all (since queue is empty).

I wholeheartly agree with you :-)

> I made the change to haproxy and simulated the conditions in a lab, and
> the strategy appears to work.
> The change to haproxy was very minor, ~10 lines in queue.c, using
> `task->nice` as the prioritization key. However my change is a very
> rough PoC, and not worthy of submission.

For a rough PoC it's indeed perfectly fine. But for a final design we
really need a separate offset. I've really been convinced in field about
using time rather than position, if you want to experiment with this I
can give you some insights, it's the same in fact.

> So before continuing any further down this route, I wanted to see if
> this is something that could make it into HAProxy, and what any thoughts
> on it might be.

Absolutely! I've dreamed of it for over a decade, so I'm glad someone
is willing to take care of it! Just checked, the item was added 12
years ago to the roadmap file in 1.2.13 on 2006/05/13 by commit 814cbc6
("[DOC] added (and updated) the ROADMAP file"). The lines were :

- wait queues replaced for priority-based trees
- ability to assign a prio based on L7 matching

The goal has not changed since, I'm patient :-)

Willy
Patrick Hemmer
Re: Priority based queuing
May 02, 2018 06:50PM
On 2018/5/2 11:04, Willy Tarreau wrote:
> On Tue, May 01, 2018 at 09:34:14PM -0400, Patrick Hemmer wrote:
>> Would it be possible to add priority based queuing to haproxy? By this I
>> mean that when a server/backend is full (maxconn), that incoming
>> requests would be added to the queue in a custom order. The idea here is
>> that when the system is under stress, to make sure the important
>> requests get handled first.
> Hehe that's fun that you mention this, as this has been postponed since
> around 1.2 or 1.3! By then we didn't have the equivalent of HTTP rules
> to add/subtract some priority. Now we have everything to do it, we "just"
> need to replace the lists with priority trees in the queues and that's
> all. It's not a big work if someone is interested in working on this.
>
>> In our exact use case, we're looking to use this to help mitigate DOS
>> attacks. The idea is that if a layer 7 attack is saturating the backend
>> servers, we can add logic to prioritize the requests. This logic might
>> be things like requests that have a valid application cookie go to the
>> front of the queue, or requests that come from a cloud provider (e.g.
>> EC2) go to the back of the queue.
> That's exactly why I wanted them to be manipulated vi http-request rules,
> so that everyone can construct his own conditions. Also I found that for
> most shopping sites, having time-based priority is more important than
> position-based : you often want this type of request to be processed 100ms
> faster than another type of request. With HTTP/2 it will be even more
> interesting because it will allow to send the important objects used for
> rendering before the other ones, which is very similar to the H2 priority
> but more fine-grained if you can adjust it on the fly.
>
>> DOS mitigation is hard because while you can write rules to identify
>> requests that are suspicious, you don't want to block them outright as
>> it is possible they might be legitimate. With prioritization, the
>> requests still get through, and are only affected when the backend is
>> saturated. If maxconn is not reached, the prioritization has no effect
>> at all (since queue is empty).
> I wholeheartly agree with you :-)
>
>> I made the change to haproxy and simulated the conditions in a lab, and
>> the strategy appears to work.
>> The change to haproxy was very minor, ~10 lines in queue.c, using
>> `task->nice` as the prioritization key. However my change is a very
>> rough PoC, and not worthy of submission.
> For a rough PoC it's indeed perfectly fine. But for a final design we
> really need a separate offset. I've really been convinced in field about
> using time rather than position, if you want to experiment with this I
> can give you some insights, it's the same in fact.
Can you elaborate on what you're thinking of for a time-based queue?

What I'm imagining you mean is that you would write a rule to set the
max queue time, and haproxy would insert it into the queue sorting on
TIME_NOW() + MAX_QUEUE_TIME. The main difference I see to this approach
vs scoring, is that you ensure that an item doesn't sit on the queue
forever (or whatever `timeout queue` is set to) if higher priority stuff
keeps getting inserted before it.
I don't think this is necessarily a good thing. If you're under a DOS
attack, the goal is to get the good requests processed before any
possible malicious requests. With a time based queue, those malicious
requests will still get processed and starve out the good requests. For
example lets say you're under attack, a bad request comes in with
max_queue_time=1000ms, and then after 999ms elapse, a good request comes
in with max_queue_time=10ms. You have a good request, and a bad request
on the queue, but HAProxy is going to process the bad request first
because its timer is expiring first. Essentially if haproxy is receiving
X good requests per second, and Y bad requests per second, it's still
going to forward X good per second, and Y bad per second, to the backend
server. The only difference is that they're time shifted.

The other thing I could think you mean by time-based is to insert into
the queue sorting on MAX_QUEUE_TIME, just like a score-based queue, but
you would still record TIME_NOW() + MAX_QUEUE_TIME, and would reject
requests that don't get processed by their deadline. Essentially a
per-request version of the `timeout queue` setting. But I don't see any
real value in this.

Or do you mean something else?

>
>> So before continuing any further down this route, I wanted to see if
>> this is something that could make it into HAProxy, and what any thoughts
>> on it might be.
> Absolutely! I've dreamed of it for over a decade, so I'm glad someone
> is willing to take care of it! Just checked, the item was added 12
> years ago to the roadmap file in 1.2.13 on 2006/05/13 by commit 814cbc6
> ("[DOC] added (and updated) the ROADMAP file"). The lines were :
>
> - wait queues replaced for priority-based trees
> - ability to assign a prio based on L7 matching
>
> The goal has not changed since, I'm patient :-)
>
> Willy
Willy Tarreau
Re: Priority based queuing
May 02, 2018 07:30PM
On Wed, May 02, 2018 at 12:44:06PM -0400, Patrick Hemmer wrote:
> Can you elaborate on what you're thinking of for a time-based queue?
>
> What I'm imagining you mean is that you would write a rule to set the
> max queue time, and haproxy would insert it into the queue sorting on
> TIME_NOW() + MAX_QUEUE_TIME. The main difference I see to this approach
> vs scoring, is that you ensure that an item doesn't sit on the queue
> forever (or whatever `timeout queue` is set to) if higher priority stuff
> keeps getting inserted before it.

In part, but mostly that under contention you can still maintain a
good quality of service. I remember having had a discussion a very
long time ago with a gaming site operator explaining that he wanted
to send profile management requests to a dedicated slow server so
that people filling in their profile do not disturb the ones playing.
For me it's a good example : "how long do you think these people are
willing to wait for the server to respond to each click to update
their profile?". Let's say 2 seconds is OK, you just add a 2s offset
to the position in the queue for such requests. If in the mean time
other higher priority requests come in, the delayed one may face the
2 seconds delay. But it will not face more. And obviously if no other
request is pending before it in the queue it will be served earlier.

A similar example is for certain shopping sites which consider that
once you have something in your shopping cart, you're much more likely
to finish the process and to pay because you've found the item you
were looking for, so you're more willing to tolerate a slightly higher
response time, and as such provide a better response time to newcomers.
But while this small delay can easily be expressed in milliseconds
(probably 50/100), it's almost impossible to define it using positions.
What if 50 new visitors suddenly come in ? You don't want the guy with
his item to suddenly experience a timeout. The time-based queue however
grants you a control over the service degradation you're accepting to
inflict to your users.

Another interesting example is when you want ot provide strong service
guarantees to some top-level visitors while running under intense load.
By using only a position, it's easy to say "I want the Gold class to
advance by 1000 positions". But if the site is highly loaded and you
have 10k pending requests, these 1000 positions remain irrelevant,
because the requests sits there, waiting for 90% of the pollution to
be flushed. If you say "I want these requests to advance by 10 seconds"
then no matter how many other requests you have in the queue, it will
really advance by 10 seconds and effectively shrink the response time
by 10 seconds.

Overall for user experience it's important to think in time and not
positions. The rationale behind this is simple : the user will never
accept as a justification for varying quality of service the number of
other visitors. And positions just depend on this, it's an average number
of people you're allowing to pass over. If I'm granted a pass ticket
allowing me to advance 10 places in the cinema queue and I find a
crowded street, I will go back home. If I'm told I won't wait more
than 5 minutes, I'll come regardless of the number of waiters.

> I don't think this is necessarily a good thing.

For having considered the two options several times in field over the
last decade when sites started to crawl, I became very well convinced ;-)

> If you're under a DOS
> attack, the goal is to get the good requests processed before any
> possible malicious requests. With a time based queue, those malicious
> requests will still get processed and starve out the good requests.

Precisely not that much because the good ones will pass before, and
the malicious ones will then be subject to the queue timeout if there
are too many.

> For
> example lets say you're under attack, a bad request comes in with
> max_queue_time=1000ms, and then after 999ms elapse, a good request comes
> in with max_queue_time=10ms. You have a good request, and a bad request
> on the queue, but HAProxy is going to process the bad request first
> because its timer is expiring first.

Absolutely but you fell into the trap of not considering that the queue
is supposed to be full, so you're in fact comparing only two requests,
while in practice you have many of them and there the effect is much
more visible.

> Essentially if haproxy is receiving
> X good requests per second, and Y bad requests per second, it's still
> going to forward X good per second, and Y bad per second, to the backend
> server. The only difference is that they're time shifted.

Except once subject to the queue timeout. It's a very important aspect we
must not dismiss.

> The other thing I could think you mean by time-based is to insert into
> the queue sorting on MAX_QUEUE_TIME, just like a score-based queue, but
> you would still record TIME_NOW() + MAX_QUEUE_TIME, and would reject
> requests that don't get processed by their deadline. Essentially a
> per-request version of the `timeout queue` setting. But I don't see any
> real value in this.
>
> Or do you mean something else?

What I mean is that the queue is indexed on the computed service time
which is (current_time + offset), where offset can be either positive or
negative, null by default. The queue is sorted by service time, just like
any timer in fact. Then the queue collection simply picks the next event
in the queue.

Please also note that right now the queue already considers the time to
maintain fairness between requests targetting a specific server and those
who just want any server. This is what ensures no request starves in either
the server's queue or the backend's queue. With a time-sorted queue, instead
of picking the tasks according to their time of arrival, we'd simply pick
them based on the service time, and we could maintain an excellent fairness
between servers and backends.

Regarding the DOS situations, there are certain aspects which slightly
differ from the points around the quality of service. DOS fighting involves
sacrifice, and basic quality of service hardly provides this by default. In
our case, sacrifice happens on a queue size or time. And the decision should
depend on the request's importance, which is often very different from the
desired response time. The well known example is the "Pay" button on many
sites : you click on it, and if it doesn't respond fast, you will not press
Escape. You're keeping fingers crossed hoping it will not return an error,
even if it takes 30 seconds. And moreover, you're happy once it responds!
Here that's what makes the difference with the response time : in fact you'd
like to say that certain requests are highly sensitive but not urgent, and
that you'd like to be able to increase their timeout and ultimately get
served.

To fight DOS it's the same. Commonly, many sites consider that requests
holding a valid cookie correspond to authenticated users and score much
better in terms of trustability. Thus by having adjustable timeouts, it
would make it very easy to adjust the queue timeout for a request based
on the presence of a cookie or not.

Now when you think about it, a reasonable but simple strategy for some of
the examples above could be summarized to this :

timeout queue 10s
# pay button not urgent but needs to work
http-request set-timeout queue 60s if PAY_BUTTON
http-request set-priority -10s if PAY_BUTTON

# unauthenticated less urgent, can be a DOS
http-request set-priority -1s if NO_COOKIE || FAVICON

# some elements that need to be deliveered quickly
http-request set-priority 1s if INDEX_HTML || CSS || JS || ICONS

# auto-completion needs to be fast but no problem if it doesn't work
http-request set-timeout queue 1s if AUTO_COMPLETION_REQUEST
http-request set-priority 10s if AUTO_COMPLETION_REQUEST

# inconsistent user-agents are suspicious, delay them just in case
http-request set-priority -20s if SUSPICIOUS_USER_AGENT

# too fast users need to slow down a little bit
http-request set-priority -20s if { sc0_rate gt 10 } || { sc0_err_cnt gt 10 }

With a proper API we can even imagine having access to these request
properties from Lua to implement far more advanced policies.

Regards,
Willy
Patrick Hemmer
Re: Priority based queuing
May 02, 2018 10:40PM
On 2018/5/2 13:22, Willy Tarreau wrote:
> On Wed, May 02, 2018 at 12:44:06PM -0400, Patrick Hemmer wrote:
>> Can you elaborate on what you're thinking of for a time-based queue?
>>
>> What I'm imagining you mean is that you would write a rule to set the
>> max queue time, and haproxy would insert it into the queue sorting on
>> TIME_NOW() + MAX_QUEUE_TIME. The main difference I see to this approach
>> vs scoring, is that you ensure that an item doesn't sit on the queue
>> forever (or whatever `timeout queue` is set to) if higher priority stuff
>> keeps getting inserted before it.
> In part, but mostly that under contention you can still maintain a
> good quality of service. I remember having had a discussion a very
> long time ago with a gaming site operator explaining that he wanted
> to send profile management requests to a dedicated slow server so
> that people filling in their profile do not disturb the ones playing.
> For me it's a good example : "how long do you think these people are
> willing to wait for the server to respond to each click to update
> their profile?". Let's say 2 seconds is OK, you just add a 2s offset
> to the position in the queue for such requests. If in the mean time
> other higher priority requests come in, the delayed one may face the
> 2 seconds delay. But it will not face more. And obviously if no other
> request is pending before it in the queue it will be served earlier.
>
> A similar example is for certain shopping sites which consider that
> once you have something in your shopping cart, you're much more likely
> to finish the process and to pay because you've found the item you
> were looking for, so you're more willing to tolerate a slightly higher
> response time, and as such provide a better response time to newcomers.
> But while this small delay can easily be expressed in milliseconds
> (probably 50/100), it's almost impossible to define it using positions.
> What if 50 new visitors suddenly come in ? You don't want the guy with
> his item to suddenly experience a timeout. The time-based queue however
> grants you a control over the service degradation you're accepting to
> inflict to your users.
There's a key difference in our designs. You're trying to classify
traffic based on destination. I'm trying to classify traffic based on
source.
In my design, the maxconn is set to a level such that the server is
healthy and every request can be processed fast. With this, and a score
based queue, and under attack, the good requests will be drained fast,
leaving only the bad requests. Thus no normal user should be getting a
timeout, no matter what resource they're going to.
Your design works when the backend system can't keep up with good
legitimate traffic, and you need to prioritize one good request over
another good request. My design is when the backend system can't keep up
with bad traffic, and so we want to process the good traffic first.

>
> Another interesting example is when you want ot provide strong service
> guarantees to some top-level visitors while running under intense load.
> By using only a position, it's easy to say "I want the Gold class to
> advance by 1000 positions". But if the site is highly loaded and you
> have 10k pending requests, these 1000 positions remain irrelevant,
> because the requests sits there, waiting for 90% of the pollution to
> be flushed.
I think you're misunderstanding my design, as scoring wouldn't work like
this at all. If you give the gold class a score of 1000 (where higher
number means higher priority), then the only thing that would get
processed before gold class would be another class that has a score >
1000. If nothing does, and a gold request comes in, it gets processed
first, no matter how big the queue is.
Think of scoring as instead having multiple queues. Each queue is only
processed if the queue before it is empty.
> If you say "I want these requests to advance by 10 seconds"
> then no matter how many other requests you have in the queue, it will
> really advance by 10 seconds and effectively shrink the response time
> by 10 seconds.
It may shrink the response time by 10 seconds, but if the normal
response time is 60 seconds, that's still 50 seconds of waiting. If your
users are only willing to put up with 30 seconds of wait, this means you
end up failing *everyone*.

>
> Overall for user experience it's important to think in time and not
> positions. The rationale behind this is simple : the user will never
> accept as a justification for varying quality of service the number of
> other visitors. And positions just depend on this, it's an average number
> of people you're allowing to pass over. If I'm granted a pass ticket
> allowing me to advance 10 places in the cinema queue and I find a
> crowded street, I will go back home. If I'm told I won't wait more
> than 5 minutes, I'll come regardless of the number of waiters.
But a time based queue wouldn't work like this. You can't guarantee
service within 5 minutes. What happens if a ton of people arrive at the
same time who were told "you will never wait more than 5 minutes", and
the cinema can't handle them all? Or if you're the only person in the
line and nobody leaves the cinema for 6 minutes? Or there are already
100 people in line who were told "you will never wait more than 1 hour",
and that hour is almost up?

>
>> I don't think this is necessarily a good thing.
> For having considered the two options several times in field over the
> last decade when sites started to crawl, I became very well convinced ;-)
>
>> If you're under a DOS
>> attack, the goal is to get the good requests processed before any
>> possible malicious requests. With a time based queue, those malicious
>> requests will still get processed and starve out the good requests.
> Precisely not that much because the good ones will pass before, and
> the malicious ones will then be subject to the queue timeout if there
> are too many.
>
>> For
>> example lets say you're under attack, a bad request comes in with
>> max_queue_time=1000ms, and then after 999ms elapse, a good request comes
>> in with max_queue_time=10ms. You have a good request, and a bad request
>> on the queue, but HAProxy is going to process the bad request first
>> because its timer is expiring first.
> Absolutely but you fell into the trap of not considering that the queue
> is supposed to be full, so you're in fact comparing only two requests,
> while in practice you have many of them and there the effect is much
> more visible.
I don't think so. Server connections are full, not the queue. The point
I was trying to make is that you have a request you are pretty confident
is bad, and you waste resources processing it while the request you
think is good gets to wait. No matter how big the queue is, as long as a
bad request sits in front of a good request in the queue (regardless of
whether good request chronologically came after bad, or bad request
chronologically came after good), this applies.
>
>> Essentially if haproxy is receiving
>> X good requests per second, and Y bad requests per second, it's still
>> going to forward X good per second, and Y bad per second, to the backend
>> server. The only difference is that they're time shifted.
> Except once subject to the queue timeout. It's a very important aspect we
> must not dismiss.
>
>> The other thing I could think you mean by time-based is to insert into
>> the queue sorting on MAX_QUEUE_TIME, just like a score-based queue, but
>> you would still record TIME_NOW() + MAX_QUEUE_TIME, and would reject
>> requests that don't get processed by their deadline. Essentially a
>> per-request version of the `timeout queue` setting. But I don't see any
>> real value in this.
>>
>> Or do you mean something else?
> What I mean is that the queue is indexed on the computed service time
> which is (current_time + offset), where offset can be either positive or
> negative, null by default. The queue is sorted by service time, just like
> any timer in fact. Then the queue collection simply picks the next event
> in the queue.
>
> Please also note that right now the queue already considers the time to
> maintain fairness between requests targetting a specific server and those
> who just want any server. This is what ensures no request starves in either
> the server's queue or the backend's queue. With a time-sorted queue, instead
> of picking the tasks according to their time of arrival, we'd simply pick
> them based on the service time, and we could maintain an excellent fairness
> between servers and backends.
>
> Regarding the DOS situations, there are certain aspects which slightly
> differ from the points around the quality of service. DOS fighting involves
> sacrifice, and basic quality of service hardly provides this by default. In
> our case, sacrifice happens on a queue size or time. And the decision should
> depend on the request's importance, which is often very different from the
> desired response time. The well known example is the "Pay" button on many
> sites : you click on it, and if it doesn't respond fast, you will not press
> Escape. You're keeping fingers crossed hoping it will not return an error,
> even if it takes 30 seconds. And moreover, you're happy once it responds!
> Here that's what makes the difference with the response time : in fact you'd
> like to say that certain requests are highly sensitive but not urgent, and
> that you'd like to be able to increase their timeout and ultimately get
> served.
>
> To fight DOS it's the same. Commonly, many sites consider that requests
> holding a valid cookie correspond to authenticated users and score much
> better in terms of trustability. Thus by having adjustable timeouts, it
> would make it very easy to adjust the queue timeout for a request based
> on the presence of a cookie or not.
>
> Now when you think about it, a reasonable but simple strategy for some of
> the examples above could be summarized to this :
>
> timeout queue 10s
> # pay button not urgent but needs to work
> http-request set-timeout queue 60s if PAY_BUTTON
> http-request set-priority -10s if PAY_BUTTON
>
> # unauthenticated less urgent, can be a DOS
> http-request set-priority -1s if NO_COOKIE || FAVICON
>
> # some elements that need to be deliveered quickly
> http-request set-priority 1s if INDEX_HTML || CSS || JS || ICONS
>
> # auto-completion needs to be fast but no problem if it doesn't work
> http-request set-timeout queue 1s if AUTO_COMPLETION_REQUEST
> http-request set-priority 10s if AUTO_COMPLETION_REQUEST
>
> # inconsistent user-agents are suspicious, delay them just in case
> http-request set-priority -20s if SUSPICIOUS_USER_AGENT
>
> # too fast users need to slow down a little bit
> http-request set-priority -20s if { sc0_rate gt 10 } || { sc0_err_cnt gt 10 }
>
> With a proper API we can even imagine having access to these request
> properties from Lua to implement far more advanced policies.
>
> Regards,
> Willy
Here's an example scenario that a time based queue cannot handle:
Lets say you have a server that can only process 10 requests per second.
Your normal traffic is 2 requests per second (500ms between req). Then
you start getting an attack of 50 requests per second from
SUSPICIOUS_USER_AGENT (20ms between req). The queue will end up with 520
requests (50rps*10s + 2rps*10s # the 10s is the timeout). Then a
INDEX_HTML request comes in. With a `set-priority 1s`, you advance it up
1s in the queue, which puts it ahead of 52 requests, but there's still
468 requests in front of it, which means it'll be 46.8s before it is
processed, and with `timeout queue 10s`, it times out. The only request
which will work is the PAY_BUTTON, and that's because of the increased
timeout, as all the SUSPICIOUS_USER_AGENT requests in front of it will
time out first, and at prio -10s, it'll take 20s to complete.

Lets go through this same scenario with a scoring system. Lets assume
that in `set-priority Xs`, the "Xs" becomes "X", where positive means
higher priority.
The queue will have 500 requests pending (this is different from the 520
above. see the end for explanation). Then a INDEX_HTML request comes in.
With `set-priority 1`, because all the SUSPICIOUS_USER_AGENT requests
have priority -20, the INDEX_HTML goes to the front of the queue and
gets processed as soon as a connection is freed. Lets also assume a
NO_COOKIE comes in, with priority -1, it'll go right after the
INDEX_HTML request. And then because SUSPICIOUS_USER_AGENT is constantly
at the end of the queue, all the good requests are processed in
near-normal time. The only wait is for a connection slot to open up. And
if the normal traffic rate is 2 requests per second and server can
handle 10 per second, there should never be more than 1 good request in
the queue (except if they come in at the same time). This is also why at
the beginning, the queue only has 500 requests pending (the only thing
in the queue is SUSPICIOUS_USER_AGENT, so 50rps*10s=500).

-Patrick
Willy Tarreau
Re: Priority based queuing
May 03, 2018 05:00AM
On Wed, May 02, 2018 at 04:29:33PM -0400, Patrick Hemmer wrote:
> I think you're misunderstanding my design, as scoring wouldn't work like
> this at all. If you give the gold class a score of 1000 (where higher
> number means higher priority), then the only thing that would get
> processed before gold class would be another class that has a score >
> 1000. If nothing does, and a gold request comes in, it gets processed
> first, no matter how big the queue is.
> Think of scoring as instead having multiple queues. Each queue is only
> processed if the queue before it is empty.
(...)

OK you convinced me. Not on everything, but on the fact that we're trying
to address different points. My goal is to make it possible to improve
quality of service between good requests, and your goal is to improve the
classification between good, suspicious, and bad requests. Each of us see
how to expand a little bit our respective model to address part of the
other one (though less efficiently).

I agree that for your goal, multi-queue is better, but I still maintain
that for my goal, the time-based queue gives better control. The good
news is that the two are orthogonal and 100% compatible.

Basically the queueing system should be redesigned as a list of time-based
trees that are visited in order of priority (or traffic classes, we'll have
to find the proper wording to avoid confusion). If you think you can address
your needs with just a small set of different priorities, probably that we
can implement this with a small array (2-4 queues). If you think you need
more, then we'd rather think about building a composite position value in
the tree made of the priority at the top of the word and the service time
at the bottom of the word. This way, picking the first value will always
find the lowest priority value first. There's one subtlety related to
wrapping time there however, but it can be addressed in two lookups.

Please let me know if you'd be fine with designing and implementing
something like this.

Cheers,
Willy
Patrick Hemmer
Re: Priority based queuing
May 03, 2018 05:30AM
On 2018/5/2 16:29, Patrick Hemmer wrote:
>
>
> On 2018/5/2 13:22, Willy Tarreau wrote:
>> On Wed, May 02, 2018 at 12:44:06PM -0400, Patrick Hemmer wrote:
>>> Can you elaborate on what you're thinking of for a time-based queue?
>>>
>>> What I'm imagining you mean is that you would write a rule to set the
>>> max queue time, and haproxy would insert it into the queue sorting on
>>> TIME_NOW() + MAX_QUEUE_TIME. The main difference I see to this approach
>>> vs scoring, is that you ensure that an item doesn't sit on the queue
>>> forever (or whatever `timeout queue` is set to) if higher priority stuff
>>> keeps getting inserted before it.
>> In part, but mostly that under contention you can still maintain a
>> good quality of service. I remember having had a discussion a very
>> long time ago with a gaming site operator explaining that he wanted
>> to send profile management requests to a dedicated slow server so
>> that people filling in their profile do not disturb the ones playing.
>> For me it's a good example : "how long do you think these people are
>> willing to wait for the server to respond to each click to update
>> their profile?". Let's say 2 seconds is OK, you just add a 2s offset
>> to the position in the queue for such requests. If in the mean time
>> other higher priority requests come in, the delayed one may face the
>> 2 seconds delay. But it will not face more. And obviously if no other
>> request is pending before it in the queue it will be served earlier.
>>
>> A similar example is for certain shopping sites which consider that
>> once you have something in your shopping cart, you're much more likely
>> to finish the process and to pay because you've found the item you
>> were looking for, so you're more willing to tolerate a slightly higher
>> response time, and as such provide a better response time to newcomers.
>> But while this small delay can easily be expressed in milliseconds
>> (probably 50/100), it's almost impossible to define it using positions.
>> What if 50 new visitors suddenly come in ? You don't want the guy with
>> his item to suddenly experience a timeout. The time-based queue however
>> grants you a control over the service degradation you're accepting to
>> inflict to your users.
> There's a key difference in our designs. You're trying to classify
> traffic based on destination. I'm trying to classify traffic based on
> source.
> In my design, the maxconn is set to a level such that the server is
> healthy and every request can be processed fast. With this, and a
> score based queue, and under attack, the good requests will be drained
> fast, leaving only the bad requests. Thus no normal user should be
> getting a timeout, no matter what resource they're going to.
> Your design works when the backend system can't keep up with good
> legitimate traffic, and you need to prioritize one good request over
> another good request. My design is when the backend system can't keep
> up with bad traffic, and so we want to process the good traffic first.
>
>> Another interesting example is when you want ot provide strong service
>> guarantees to some top-level visitors while running under intense load.
>> By using only a position, it's easy to say "I want the Gold class to
>> advance by 1000 positions". But if the site is highly loaded and you
>> have 10k pending requests, these 1000 positions remain irrelevant,
>> because the requests sits there, waiting for 90% of the pollution to
>> be flushed.
> I think you're misunderstanding my design, as scoring wouldn't work
> like this at all. If you give the gold class a score of 1000 (where
> higher number means higher priority), then the only thing that would
> get processed before gold class would be another class that has a
> score > 1000. If nothing does, and a gold request comes in, it gets
> processed first, no matter how big the queue is.
> Think of scoring as instead having multiple queues. Each queue is only
> processed if the queue before it is empty.
>> If you say "I want these requests to advance by 10 seconds"
>> then no matter how many other requests you have in the queue, it will
>> really advance by 10 seconds and effectively shrink the response time
>> by 10 seconds.
> It may shrink the response time by 10 seconds, but if the normal
> response time is 60 seconds, that's still 50 seconds of waiting. If
> your users are only willing to put up with 30 seconds of wait, this
> means you end up failing *everyone*.
>
>> Overall for user experience it's important to think in time and not
>> positions. The rationale behind this is simple : the user will never
>> accept as a justification for varying quality of service the number of
>> other visitors. And positions just depend on this, it's an average number
>> of people you're allowing to pass over. If I'm granted a pass ticket
>> allowing me to advance 10 places in the cinema queue and I find a
>> crowded street, I will go back home. If I'm told I won't wait more
>> than 5 minutes, I'll come regardless of the number of waiters.
> But a time based queue wouldn't work like this. You can't guarantee
> service within 5 minutes. What happens if a ton of people arrive at
> the same time who were told "you will never wait more than 5 minutes",
> and the cinema can't handle them all? Or if you're the only person in
> the line and nobody leaves the cinema for 6 minutes? Or there are
> already 100 people in line who were told "you will never wait more
> than 1 hour", and that hour is almost up?
>
>>> I don't think this is necessarily a good thing.
>> For having considered the two options several times in field over the
>> last decade when sites started to crawl, I became very well convinced ;-)
>>
>>> If you're under a DOS
>>> attack, the goal is to get the good requests processed before any
>>> possible malicious requests. With a time based queue, those malicious
>>> requests will still get processed and starve out the good requests.
>> Precisely not that much because the good ones will pass before, and
>> the malicious ones will then be subject to the queue timeout if there
>> are too many.
>>
>>> For
>>> example lets say you're under attack, a bad request comes in with
>>> max_queue_time=1000ms, and then after 999ms elapse, a good request comes
>>> in with max_queue_time=10ms. You have a good request, and a bad request
>>> on the queue, but HAProxy is going to process the bad request first
>>> because its timer is expiring first.
>> Absolutely but you fell into the trap of not considering that the queue
>> is supposed to be full, so you're in fact comparing only two requests,
>> while in practice you have many of them and there the effect is much
>> more visible.
> I don't think so. Server connections are full, not the queue. The
> point I was trying to make is that you have a request you are pretty
> confident is bad, and you waste resources processing it while the
> request you think is good gets to wait. No matter how big the queue
> is, as long as a bad request sits in front of a good request in the
> queue (regardless of whether good request chronologically came after
> bad, or bad request chronologically came after good), this applies.
>>> Essentially if haproxy is receiving
>>> X good requests per second, and Y bad requests per second, it's still
>>> going to forward X good per second, and Y bad per second, to the backend
>>> server. The only difference is that they're time shifted.
>> Except once subject to the queue timeout. It's a very important aspect we
>> must not dismiss.
>>
>>> The other thing I could think you mean by time-based is to insert into
>>> the queue sorting on MAX_QUEUE_TIME, just like a score-based queue, but
>>> you would still record TIME_NOW() + MAX_QUEUE_TIME, and would reject
>>> requests that don't get processed by their deadline. Essentially a
>>> per-request version of the `timeout queue` setting. But I don't see any
>>> real value in this.
>>>
>>> Or do you mean something else?
>> What I mean is that the queue is indexed on the computed service time
>> which is (current_time + offset), where offset can be either positive or
>> negative, null by default. The queue is sorted by service time, just like
>> any timer in fact. Then the queue collection simply picks the next event
>> in the queue.
>>
>> Please also note that right now the queue already considers the time to
>> maintain fairness between requests targetting a specific server and those
>> who just want any server. This is what ensures no request starves in either
>> the server's queue or the backend's queue. With a time-sorted queue, instead
>> of picking the tasks according to their time of arrival, we'd simply pick
>> them based on the service time, and we could maintain an excellent fairness
>> between servers and backends.
>>
>> Regarding the DOS situations, there are certain aspects which slightly
>> differ from the points around the quality of service. DOS fighting involves
>> sacrifice, and basic quality of service hardly provides this by default. In
>> our case, sacrifice happens on a queue size or time. And the decision should
>> depend on the request's importance, which is often very different from the
>> desired response time. The well known example is the "Pay" button on many
>> sites : you click on it, and if it doesn't respond fast, you will not press
>> Escape. You're keeping fingers crossed hoping it will not return an error,
>> even if it takes 30 seconds. And moreover, you're happy once it responds!
>> Here that's what makes the difference with the response time : in fact you'd
>> like to say that certain requests are highly sensitive but not urgent, and
>> that you'd like to be able to increase their timeout and ultimately get
>> served.
>>
>> To fight DOS it's the same. Commonly, many sites consider that requests
>> holding a valid cookie correspond to authenticated users and score much
>> better in terms of trustability. Thus by having adjustable timeouts, it
>> would make it very easy to adjust the queue timeout for a request based
>> on the presence of a cookie or not.
>>
>> Now when you think about it, a reasonable but simple strategy for some of
>> the examples above could be summarized to this :
>>
>> timeout queue 10s
>> # pay button not urgent but needs to work
>> http-request set-timeout queue 60s if PAY_BUTTON
>> http-request set-priority -10s if PAY_BUTTON
>>
>> # unauthenticated less urgent, can be a DOS
>> http-request set-priority -1s if NO_COOKIE || FAVICON
>>
>> # some elements that need to be deliveered quickly
>> http-request set-priority 1s if INDEX_HTML || CSS || JS || ICONS
>>
>> # auto-completion needs to be fast but no problem if it doesn't work
>> http-request set-timeout queue 1s if AUTO_COMPLETION_REQUEST
>> http-request set-priority 10s if AUTO_COMPLETION_REQUEST
>>
>> # inconsistent user-agents are suspicious, delay them just in case
>> http-request set-priority -20s if SUSPICIOUS_USER_AGENT
>>
>> # too fast users need to slow down a little bit
>> http-request set-priority -20s if { sc0_rate gt 10 } || { sc0_err_cnt gt 10 }
>>
>> With a proper API we can even imagine having access to these request
>> properties from Lua to implement far more advanced policies.
>>
>> Regards,
>> Willy
> Here's an example scenario that a time based queue cannot handle:
> Lets say you have a server that can only process 10 requests per
> second. Your normal traffic is 2 requests per second (500ms between
> req). Then you start getting an attack of 50 requests per second from
> SUSPICIOUS_USER_AGENT (20ms between req). The queue will end up with
> 520 requests (50rps*10s + 2rps*10s # the 10s is the timeout). Then a
> INDEX_HTML request comes in. With a `set-priority 1s`, you advance it
> up 1s in the queue, which puts it ahead of 52 requests, but there's
> still 468 requests in front of it, which means it'll be 46.8s before
> it is processed, and with `timeout queue 10s`, it times out. The only
> request which will work is the PAY_BUTTON, and that's because of the
> increased timeout, as all the SUSPICIOUS_USER_AGENT requests in front
> of it will time out first, and at prio -10s, it'll take 20s to complete.
>
> Lets go through this same scenario with a scoring system. Lets assume
> that in `set-priority Xs`, the "Xs" becomes "X", where positive means
> higher priority.
> The queue will have 500 requests pending (this is different from the
> 520 above. see the end for explanation). Then a INDEX_HTML request
> comes in. With `set-priority 1`, because all the SUSPICIOUS_USER_AGENT
> requests have priority -20, the INDEX_HTML goes to the front of the
> queue and gets processed as soon as a connection is freed. Lets also
> assume a NO_COOKIE comes in, with priority -1, it'll go right after
> the INDEX_HTML request. And then because SUSPICIOUS_USER_AGENT is
> constantly at the end of the queue, all the good requests are
> processed in near-normal time. The only wait is for a connection slot
> to open up. And if the normal traffic rate is 2 requests per second
> and server can handle 10 per second, there should never be more than 1
> good request in the queue (except if they come in at the same time).
> This is also why at the beginning, the queue only has 500 requests
> pending (the only thing in the queue is SUSPICIOUS_USER_AGENT, so
> 50rps*10s=500).
>
> -Patrick
>
>
After re-reading this, and thinking through your design again, I think
my understanding of your time-based queuing proposal, and thus my
example for it, was wrong.
The earliest SUSPICIOUS_USER_AGENT request would come in at T=0s, and so
it would go in the queue with key=20 (key = T - priority). Then when
INDEX_HTML comes in at T=10s (the oldest SUSPICIOUS_USER_AGENT is timing
out), it goes into the queue with key=9, which sorts before key=20.

This is better, but still problematic as lets say you had a rule that
did `set-priority -15s`. Then at T=10s, it would get queued at key=25s.
The admin has judged it to be more important than SUSPICIOUS_USER_AGENT
(prio -15s vs. prio -20s), but the server is still going to waste time
processing all those SUSPICIOUS_USER_AGENT requests.

While this is not what I would desire, I can see there might be a use
case for it. So how about something that would support both:

# insert into queue with time-based key
http-request set-priority date(20) if SUSPICIOUS_USER_AGENT

and

# insert into queue with a relative key
http-request set-priority int(20) if SUSPICIOUS_USER_AGENT

The queue is processed in ascending order, so this should handle both.
I would also propose a sample fetch for retrieving, as well as a LUA
method for retrieving & setting, so that you can perform relative
adjustments to it. E.G:

http-request set-priority priority,add(5) if FOO


-Patrick
Patrick Hemmer
Re: Priority based queuing
May 05, 2018 01:00AM
On 2018/5/2 22:56, Willy Tarreau wrote:
> On Wed, May 02, 2018 at 04:29:33PM -0400, Patrick Hemmer wrote:
>> I think you're misunderstanding my design, as scoring wouldn't work like
>> this at all. If you give the gold class a score of 1000 (where higher
>> number means higher priority), then the only thing that would get
>> processed before gold class would be another class that has a score >
>> 1000. If nothing does, and a gold request comes in, it gets processed
>> first, no matter how big the queue is.
>> Think of scoring as instead having multiple queues. Each queue is only
>> processed if the queue before it is empty.
> (...)
>
> OK you convinced me. Not on everything, but on the fact that we're trying
> to address different points. My goal is to make it possible to improve
> quality of service between good requests, and your goal is to improve the
> classification between good, suspicious, and bad requests. Each of us see
> how to expand a little bit our respective model to address part of the
> other one (though less efficiently).
>
> I agree that for your goal, multi-queue is better, but I still maintain
> that for my goal, the time-based queue gives better control. The good
> news is that the two are orthogonal and 100% compatible.
>
> Basically the queueing system should be redesigned as a list of time-based
> trees that are visited in order of priority (or traffic classes, we'll have
> to find the proper wording to avoid confusion). If you think you can address
> your needs with just a small set of different priorities, probably that we
> can implement this with a small array (2-4 queues). If you think you need
> more, then we'd rather think about building a composite position value in
> the tree made of the priority at the top of the word and the service time
> at the bottom of the word. This way, picking the first value will always
> find the lowest priority value first. There's one subtlety related to
> wrapping time there however, but it can be addressed in two lookups.
>
> Please let me know if you'd be fine with designing and implementing
> something like this.
>
> Cheers,
> Willy

I'm not quite following the need for multiple queues. Why wouldn't you
just have one sorted queue, where if multiple pending requests have the
same priority, then they're FIFO.

I had sent another email which proposed an implementation that would
satisfy both designs
(https://www.mail-archive.com/[email protected]/msg29876.html).
Unfortunately I was composing it when you sent your most recent reply,
so I didn't see it before I sent.
I went ahead and implemented the functionality I had talked about in
that other email, and have attached a WIP patch. I've done some testing
on it, and it seems to work fine. The syntax is exactly as proposed:

# time-based priority
http-request set-priority date(20) if LOW_PRIORITY
http-request set-priority date(-20) if HIGH_PRIORITY

# score-based priority
http-request set-priority int(20) if LOW_PRIORITY
http-request set-priority int(-20) if HIGH_PRIORITY

Some notes on the implementation:

I used a `long` for tracking the priority. Obviously this has
limitations when used with date(), as the timestamps are fairly large. I
can change this to `long long`.

The queue was implemented using the existing pendconns code. The only
thing that was changed was to insert in sorted order. Obviously this is
just a quick and dirty implementation. There are lots of priority queue
implementations to choose from, and I don't know which one is best
suited to this kind of task. We could keep the existing sorted
linked-list, and just do some optimizations for insertion. Or we could
switch to some sort of tree. The linked-list has the advantage in that I
suspect that the vast majority of additions will append to the end, and
pops are always off the front, so well suited to a linked list. A tree
might become very unbalanced very fast, and require lots of maintenance.

Using the date() fetch obviously doesn't provide good precision. We
could add a `datems()`, `datens()`, or `ticks()` (ms since start) sample
fetch. But obviously we would need to use `long long` for that, Though
ticks could use a `long`, but would wrap at around 16 months.

I still plan on adding a sample fetch for obtaining the current priority
value, and a lua function for setting it. Not sure if we want a lua
function for retrieving, or if using the fetch is good enough.

-Patrick

From 4783dccfc869cfb4219d195ba21f896272211c9e Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 4 May 2018 16:31:16 -0400
Subject: [PATCH] MINOR: add {http,tcp}-request set-priority for queue
prioritization

This adds the set-priority action to http-request and tcp-request content.
The priority value is used when connections are queued to determine which connections should be served first. The lowest integer value is served first (akin to the 'nice' parameter).
---
include/types/stream.h | 1 +
src/queue.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
src/stream.c | 1 +
3 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/include/types/stream.h b/include/types/stream.h
index 0dbc79f44..8a2eb8beb 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -125,6 +125,7 @@ struct stream {

struct server *srv_conn; /* stream already has a slot on a server and is not in queue */
struct pendconn *pend_pos; /* if not NULL, points to the pending position in the pending queue */
+ long priority; /* the priority of the stream in the pending queue */

struct http_txn *txn; /* current HTTP transaction being processed. Should become a list. */

diff --git a/src/queue.c b/src/queue.c
index 1c730c75c..1a2cea693 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -15,11 +15,14 @@
#include <common/time.h>
#include <common/hathreads.h>

+#include <proto/proto_http.h>
#include <proto/queue.h>
+#include <proto/sample.h>
#include <proto/server.h>
#include <proto/stream.h>
#include <proto/stream_interface.h>
#include <proto/task.h>
+#include <proto/tcp_rules.h>


struct pool_head *pool_head_pendconn;
@@ -195,7 +198,7 @@ void process_srv_queue(struct server *s)
*/
struct pendconn *pendconn_add(struct stream *strm)
{
- struct pendconn *p;
+ struct pendconn *p, *pn;
struct proxy *px;
struct server *srv;

@@ -219,7 +222,10 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->logs.srv_queue_size += srv->nbpend;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
- LIST_ADDQ(&srv->pendconns, &p->list);
+ list_for_each_entry(pn, &srv->pendconns, list)
+ if (p->strm->priority < pn->strm->priority)
+ break;
+ LIST_ADDQ(&pn->list, &p->list);
HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
}
else {
@@ -228,7 +234,10 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->logs.prx_queue_size += px->nbpend;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
- LIST_ADDQ(&px->pendconns, &p->list);
+ list_for_each_entry(pn, &px->pendconns, list)
+ if (p->strm->priority < pn->strm->priority)
+ break;
+ LIST_ADDQ(&pn->list, &p->list);
HA_SPIN_UNLOCK(PROXY_LOCK, &px->lock);
}
HA_ATOMIC_ADD(&px->totpend, 1);
@@ -392,6 +401,64 @@ void pendconn_free(struct pendconn *p)
pool_free(pool_head_pendconn, p);
}

+static enum act_return action_set_priority(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (smp)
+ s->priority = smp->data.u.sint;
+
+ return ACT_RET_CONT;
+}
+
+
+static enum act_parse_ret parse_set_priority(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority;
+ return ACT_RET_PRS_OK;
+}
+
+static struct action_kw_list tcp_cont_kws = {ILH, {
+ { "set-priority", parse_set_priority },
+ { /* END */ }
+}};
+
+static struct action_kw_list http_req_kws = {ILH, {
+ { "set-priority", parse_set_priority },
+ { /* END */ }
+}};
+
+__attribute__((constructor))
+static void __queue_init(void)
+{
+ tcp_req_cont_keywords_register(&tcp_cont_kws);
+ http_req_keywords_register(&http_req_kws);
+}
+
/*
* Local variables:
* c-indent-level: 8
diff --git a/src/stream.c b/src/stream.c
index 1d0b22ca3..9d4bc0bbb 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -220,6 +220,7 @@ struct stream *stream_new(struct session *sess, enum obj_type *origin)
s->target = sess->listener ? sess->listener->default_target : NULL;

s->pend_pos = NULL;
+ s->priority = 0;

/* init store persistence */
s->store_count = 0;
--
2.16.3
Willy Tarreau
Re: Priority based queuing
May 05, 2018 07:40AM
On Fri, May 04, 2018 at 06:49:00PM -0400, Patrick Hemmer wrote:
> I'm not quite following the need for multiple queues. Why wouldn't you
> just have one sorted queue, where if multiple pending requests have the
> same priority, then they're FIFO.

That's what the time-ordered queue does. I proposed this so that you can
ensure that lower priorities are always processed first until there is no
more of the same level.

> I had sent another email which proposed an implementation that would
> satisfy both designs
> (https://www.mail-archive.com/[email protected]/msg29876.html).
> Unfortunately I was composing it when you sent your most recent reply,
> so I didn't see it before I sent.

Thanks, I missed this one indeed.

> I went ahead and implemented the functionality I had talked about in
> that other email, and have attached a WIP patch. I've done some testing
> on it, and it seems to work fine. The syntax is exactly as proposed:
>
> # time-based priority
> http-request set-priority date(20) if LOW_PRIORITY
> http-request set-priority date(-20) if HIGH_PRIORITY
>
> # score-based priority
> http-request set-priority int(20) if LOW_PRIORITY
> http-request set-priority int(-20) if HIGH_PRIORITY
>
> Some notes on the implementation:
>
> I used a `long` for tracking the priority. Obviously this has
> limitations when used with date(), as the timestamps are fairly large. I
> can change this to `long long`.

Ah now I understand how you wanted to implement it. I have a big concern
here with wrapping time however, which is what I explained in an earlier
mail where I said that if needed we could mix date and priority in the
same integer but that would still require two lookups. Our date wraps
every 49.7 days (32-bit millisecond resolution) so depending on the
phases of the moon some requests will be processed before the highest
priority ones or after, and you'll see some priority inversion 7 times
a year.

> The queue was implemented using the existing pendconns code. The only
> thing that was changed was to insert in sorted order. Obviously this is
> just a quick and dirty implementation. There are lots of priority queue
> implementations to choose from, and I don't know which one is best
> suited to this kind of task. We could keep the existing sorted
> linked-list, and just do some optimizations for insertion. Or we could
> switch to some sort of tree. The linked-list has the advantage in that I
> suspect that the vast majority of additions will append to the end, and
> pops are always off the front, so well suited to a linked list. A tree
> might become very unbalanced very fast, and require lots of maintenance.

No, trees are quite cheap, we use them absolutely everywhere now. Just
take a look at wake_expired_tasks() to get an idea. In fact every single
time we've been relying on lists doing linear lookups, we've got several
bug reports about haproxy eating all the CPU because some people were
using it in a way that wasn't initially considered as reasonable but
which made a lot of sense to them. In the case of the list above, the
corner case would be server with "maxconn 5000" and half of the requests
having a high priority and the other half having a low one, resulting in
an average of 2500 visits in the optimal case where you can insert from
both sides (which was what we used to do a very long time ago in the
scheduler).

> Using the date() fetch obviously doesn't provide good precision.

Hmmm good point here it's seconds so it will not wrap for the next 20
years but will provide a useless precision.

> We
> could add a `datems()`, `datens()`, or `ticks()` (ms since start) sample
> fetch. But obviously we would need to use `long long` for that, Though
> ticks could use a `long`, but would wrap at around 16 months.

No, 49.7 days ;-)

The other problem I'm having here is that it artificially mixes two types
of totally unrelated data and relies on the lack of intersection between
their domains (when time doesn't wrap). Indeed, here you're considering
that values below a certain number cannot be dates so they definitely are
fixed priorities. But that also means that people who want to use ranges
will randomly collide with dates and will observe a changing behaviour
along the year.

In this case you'd be better of with the principle consisting in using
a larger integer with the priority at the top and the date at the bottom.
But that's where I said that it would require two lookups to deal with
wrapping date so I'm not sure it's any better than using multiple queues.
Well, at least it's less maintenance burden. The detailed principle is
the following :

uint64_t priority :

63 32 31 0
[ hard priority | date ]

next = lookup_first(queue);
if (!next)
return;

=> next is the lowest value of both priority and date

Then you perform a second lookup with the oldest past date for the same
priority level :

key = (next->key & 0xffffffff00000000ULL) | (now_ms - TIMER_LOOK_BACK);
next2 = lookup_ge(queue, key);
if (next2 && !((next2->value ^ next->value) >> 32))
next=next2

Then you dequeue next which is always the next one.

That's where I thought that using two distinct levels was simpler, but
if we factor in the fact that it would limit the number of priority
classes and would require more maintenance code, it's probably not
worth the hassle to save the 4 extra lines we have above to respect
the wrapping date.

Also I'm thinking that we can even use 32-bit by putting the frontier
between the date and the fixed priority (let's call it class) somewhere
else :
- 8 bit class + 24 bit offset => 256 classes and +/- 2.3 hours offsets
- 12 bit class + 20 bit offset => 4096 classes and +/- 8 minutes offsets
- 16 bit class + 16 bit offset => 64k classes and +/- 32 second offsets

I'm pretty sure that the two first options are way enough to cover almost
everyone's needs. Let's say for a moment that we'd use the second option
(12+20 bits), the queue insertion code becomes this :

pendconn->node.key = (stream->prio_class << 20) +
(now_ms + stream->prio_offset) & 0xFFFFF;
if (srv)
eb32_insert(&srv->pendconns, &pendconn->node);
else
eb32_insert(&px->pendconns, &pendconn->node);


And the dequeuing code in pendconn_process_next_strm() which takes care
both of the backend's and server's queues while still respecting their
classes would become (modulo locking, which is not trivial there but
which has to be preserved regardless of the solution) :

srv_node = eb32_first(&srv->pendconns);
prx_node = eb32_first(&prx->pendconns);
if (!srv_node)
return prx_node;
if (!prx_node)
return srv_node;

if (srv_node->key >> 20 < prx_node->key >> 20)
return srv_node;

if (srv_node->key >> 20 > prx_node->key >> 20)
return prx_node;

/* classes are equal, pick oldest expiration first */
key = (srv_node->key & 0xfff00000) +
(now_ms - (TIMER_LOOK_BACK >> 12)) & 0xFFFFF;

srv_node2 = eb32_lookup_ge(&srv->pendconns, key);
if (srv_node2 && !((srv_node2->value ^ srv_node->value) >> 20))
srv_node = srv_node2;

prx_node2 = eb32_lookup_ge(&prx->pendconns, key);
if (prx_node2 && !((prx_node2->value ^ prx_node->value) >> 20))
prx_node = prx_node2;

/* take the first one in time order */
if ((prx_node->key - srv_node->key) & 0xFFFFF < 0x80000)
return srv_node;
else
return prx_node;

All of this looks fairly reasonable to me and could possibly remove a
bit of the current complexity in the current function which has to
dereference the streams to find the queuing dates.

Then the documentation becomes pretty simple :
- all requests from the lowest numbered priority classes are always
processed first
- within a same priority class, requests are processed in order of
arrival, to which a positive or negative time offset an be applied

We could then have this in http-request and tcp-request :

http-request set-priority-class int(x)
http-request set-priority-offset int(x)

BTW you were right to remind me about using a sample expression instead
of a static integer in the argument. I've been used to use static values
everywhere for years but for such cases, being able to use expressions
is much more useful as we could extract the value from a header or cookie
for example.

Willy
Patrick Hemmer
Re: Priority based queuing
May 05, 2018 07:40PM
On 2018/5/5 01:29, Willy Tarreau wrote:
> On Fri, May 04, 2018 at 06:49:00PM -0400, Patrick Hemmer wrote:
>> I'm not quite following the need for multiple queues. Why wouldn't you
>> just have one sorted queue, where if multiple pending requests have the
>> same priority, then they're FIFO.
> That's what the time-ordered queue does. I proposed this so that you can
> ensure that lower priorities are always processed first until there is no
> more of the same level.
Ah, I was considering the 2 solutions as an either-or choice. Not that
you'd use both in the same configuration, or at least in the same backend.

>> The queue was implemented using the existing pendconns code. The only
>> thing that was changed was to insert in sorted order. Obviously this is
>> just a quick and dirty implementation. There are lots of priority queue
>> implementations to choose from, and I don't know which one is best
>> suited to this kind of task. We could keep the existing sorted
>> linked-list, and just do some optimizations for insertion. Or we could
>> switch to some sort of tree. The linked-list has the advantage in that I
>> suspect that the vast majority of additions will append to the end, and
>> pops are always off the front, so well suited to a linked list. A tree
>> might become very unbalanced very fast, and require lots of maintenance.
> No, trees are quite cheap, we use them absolutely everywhere now. Just
> take a look at wake_expired_tasks() to get an idea. In fact every single
> time we've been relying on lists doing linear lookups, we've got several
> bug reports about haproxy eating all the CPU because some people were
> using it in a way that wasn't initially considered as reasonable but
> which made a lot of sense to them.
This is why I liked the idea of a single implementation. It gave the
flexibility to the user to do whatever they wanted (time or class
based). Granted I was only thinking of doing one or the other, but the
user could still do both, they would just have to manage dedicating a
range to the date/class. Maybe someone would want to consider date
before class.

> In this case you'd be better of with the principle consisting in using
> a larger integer with the priority at the top and the date at the bottom.
> But that's where I said that it would require two lookups to deal with
> wrapping date so I'm not sure it's any better than using multiple queues.
> Well, at least it's less maintenance burden. The detailed principle is
> the following :
>
> uint64_t priority :
>
> 63 32 31 0
> [ hard priority | date ]
>
> next = lookup_first(queue);
> if (!next)
> return;
>
> => next is the lowest value of both priority and date
>
> Then you perform a second lookup with the oldest past date for the same
> priority level :
>
> key = (next->key & 0xffffffff00000000ULL) | (now_ms - TIMER_LOOK_BACK);
> next2 = lookup_ge(queue, key);
> if (next2 && !((next2->value ^ next->value) >> 32))
> next=next2
>
> Then you dequeue next which is always the next one.
>
> That's where I thought that using two distinct levels was simpler, but
> if we factor in the fact that it would limit the number of priority
> classes and would require more maintenance code, it's probably not
> worth the hassle to save the 4 extra lines we have above to respect
> the wrapping date.
>
> Also I'm thinking that we can even use 32-bit by putting the frontier
> between the date and the fixed priority (let's call it class) somewhere
> else :
> - 8 bit class + 24 bit offset => 256 classes and +/- 2.3 hours offsets
> - 12 bit class + 20 bit offset => 4096 classes and +/- 8 minutes offsets
> - 16 bit class + 16 bit offset => 64k classes and +/- 32 second offsets
What's the benefit to using 32-bit over 64-bit, and is that benefit
worth the cost of the added code complexity and processing time? If we
used 64-bit, we could do a 16/48 bit split and have an absolute
timestamp out to year 10889 at millisecond precision.

> I'm pretty sure that the two first options are way enough to cover almost
> everyone's needs.
For me a 12-bit class should work. My intention is to implement
something similar to email spam filtering, where you generate a score
based on the rules that match, and some rules are worth more than
others. 8 bits might be a bit difficult to work within.
I also suspect the 16-bit option (32 second offset) is too small for
date based.

> Let's say for a moment that we'd use the second option
> (12+20 bits), the queue insertion code becomes this :
>
> pendconn->node.key = (stream->prio_class << 20) +
> (now_ms + stream->prio_offset) & 0xFFFFF;
> if (srv)
> eb32_insert(&srv->pendconns, &pendconn->node);
> else
> eb32_insert(&px->pendconns, &pendconn->node);
>
>
> And the dequeuing code in pendconn_process_next_strm() which takes care
> both of the backend's and server's queues while still respecting their
> classes would become (modulo locking, which is not trivial there but
> which has to be preserved regardless of the solution) :
>
> srv_node = eb32_first(&srv->pendconns);
> prx_node = eb32_first(&prx->pendconns);
> if (!srv_node)
> return prx_node;
> if (!prx_node)
> return srv_node;
>
> if (srv_node->key >> 20 < prx_node->key >> 20)
> return srv_node;
>
> if (srv_node->key >> 20 > prx_node->key >> 20)
> return prx_node;
>
> /* classes are equal, pick oldest expiration first */
> key = (srv_node->key & 0xfff00000) +
> (now_ms - (TIMER_LOOK_BACK >> 12)) & 0xFFFFF;
>
> srv_node2 = eb32_lookup_ge(&srv->pendconns, key);
> if (srv_node2 && !((srv_node2->value ^ srv_node->value) >> 20))
> srv_node = srv_node2;
>
> prx_node2 = eb32_lookup_ge(&prx->pendconns, key);
> if (prx_node2 && !((prx_node2->value ^ prx_node->value) >> 20))
> prx_node = prx_node2;
>
> /* take the first one in time order */
> if ((prx_node->key - srv_node->key) & 0xFFFFF < 0x80000)
> return srv_node;
> else
> return prx_node;
>
> All of this looks fairly reasonable to me and could possibly remove a
> bit of the current complexity in the current function which has to
> dereference the streams to find the queuing dates.
>
> Then the documentation becomes pretty simple :
> - all requests from the lowest numbered priority classes are always
> processed first
> - within a same priority class, requests are processed in order of
> arrival, to which a positive or negative time offset an be applied
>
> We could then have this in http-request and tcp-request :
>
> http-request set-priority-class int(x)
> http-request set-priority-offset int(x)
>
>
> BTW you were right to remind me about using a sample expression instead
> of a static integer in the argument. I've been used to use static values
> everywhere for years but for such cases, being able to use expressions
> is much more useful as we could extract the value from a header or cookie
> for example.
>
> Willy

The rest of this all seems fine. I've got no other thoughts to add.

So the question is, which route do we go?
1. 32-bit int (12/20 split)
2. 64-bit int (32/32 split)
3. 64-bit int (16/48 split) with absolute timestamps
4. 64-bit int with a single `set-priority` action. Let the user use the
a new`ms()` fetch for time based. The user does the bit shifting if they
want to use both (using the `mul()` converter, or we can add a `shift()`).

-Patrick
Willy Tarreau
Re: Priority based queuing
May 05, 2018 08:00PM
On Sat, May 05, 2018 at 01:33:51PM -0400, Patrick Hemmer wrote:
> > Also I'm thinking that we can even use 32-bit by putting the frontier
> > between the date and the fixed priority (let's call it class) somewhere
> > else :
> > - 8 bit class + 24 bit offset => 256 classes and +/- 2.3 hours offsets
> > - 12 bit class + 20 bit offset => 4096 classes and +/- 8 minutes offsets
> > - 16 bit class + 16 bit offset => 64k classes and +/- 32 second offsets
> What's the benefit to using 32-bit over 64-bit, and is that benefit
> worth the cost of the added code complexity and processing time?

Slightly smaller memory footprint (doesn't count much), but more
importantly much smaller processing time especially on 32-bit systems.

> If we
> used 64-bit, we could do a 16/48 bit split and have an absolute
> timestamp out to year 10889 at millisecond precision.

Not really because that would force us to implement yet another clock
and then to have two distinct set of clocks for queues and for all other
internal events and timeouts (including the queue timeout!). That would
really be the beginning of a mess. We've been running with 32-bit
millisecond-precise ticks for over a decade without ever a complain that
it was not enough nor that the wrapping at 49.7 days was an issue. If
one day we'd decide to change this, it would be to support finer-grained
timestamps (eg: nanoseconds or CPU cycles) which would not necessarily
protect us from wrapping either.

> > I'm pretty sure that the two first options are way enough to cover almost
> > everyone's needs.
> For me a 12-bit class should work. My intention is to implement
> something similar to email spam filtering, where you generate a score
> based on the rules that match, and some rules are worth more than
> others. 8 bits might be a bit difficult to work within.

Makes sense.

> I also suspect the 16-bit option (32 second offset) is too small for
> date based.

Yes I agree. I'm pretty sure that for some TCP connections certain users
will want to have a long delay (eg: wait a minute before your FTP transfer
starts).

> The rest of this all seems fine. I've got no other thoughts to add.

Great, thanks!

> So the question is, which route do we go?
> 1. 32-bit int (12/20 split)
> 2. 64-bit int (32/32 split)

I think you can start with the 12/20 split first, knowing that if
anybody ever requests an extension of either time or classes we can
switch to 64 without much effort. Until this happens we'll be more
efficient on small systems. Conversely we can't do it the other way
around because we could break some setups :-)

> 3. 64-bit int (16/48 split) with absolute timestamps

I don't like it much for the reasons above.

> 4. 64-bit int with a single `set-priority` action. Let the user use the
> a new`ms()` fetch for time based. The user does the bit shifting if they
> want to use both (using the `mul()` converter, or we can add a `shift()`).

I think this level of complexity ought to be done once in the code and not
each and every time for all users. It would only be manageable for a number
of the people on this list, but newcomers would call us words for this :-)

Thanks,
Willy
Patrick Hemmer
[PATCH 0/2] Re: Priority based queuing
May 10, 2018 07:40AM
On 2018/5/5 13:55, Willy Tarreau wrote:
> On Sat, May 05, 2018 at 01:33:51PM -0400, Patrick Hemmer wrote:
>>> Also I'm thinking that we can even use 32-bit by putting the frontier
>>> between the date and the fixed priority (let's call it class) somewhere
>>> else :
>>> - 8 bit class + 24 bit offset => 256 classes and +/- 2.3 hours
offsets
>>> - 12 bit class + 20 bit offset => 4096 classes and +/- 8 minutes
offsets
>>> - 16 bit class + 16 bit offset => 64k classes and +/- 32 second
offsets
>> What's the benefit to using 32-bit over 64-bit, and is that benefit
>> worth the cost of the added code complexity and processing time?
>
> Slightly smaller memory footprint (doesn't count much), but more
> importantly much smaller processing time especially on 32-bit systems.
I would have expected the extra processing for wrapping to negate the
benefit.

>
>> If we
>> used 64-bit, we could do a 16/48 bit split and have an absolute
>> timestamp out to year 10889 at millisecond precision.
>
> Not really because that would force us to implement yet another clock
> and then to have two distinct set of clocks for queues and for all other
> internal events and timeouts (including the queue timeout!).
We already have the `now` and `date` variables. The priority is only
used for the queue key, and not to replace anything else. The queue key
is already getting replaced, so it would be no additional work.

> That would
> really be the beginning of a mess. We've been running with 32-bit
> millisecond-precise ticks for over a decade without ever a complain that
> it was not enough nor that the wrapping at 49.7 days was an issue. If
> one day we'd decide to change this, it would be to support finer-grained
> timestamps (eg: nanoseconds or CPU cycles) which would not necessarily
> protect us from wrapping either.
>
>>> I'm pretty sure that the two first options are way enough to cover
almost
>>> everyone's needs.
>> For me a 12-bit class should work. My intention is to implement
>> something similar to email spam filtering, where you generate a score
>> based on the rules that match, and some rules are worth more than
>> others. 8 bits might be a bit difficult to work within.
>
> Makes sense.
>
>> I also suspect the 16-bit option (32 second offset) is too small for
>> date based.
>
> Yes I agree. I'm pretty sure that for some TCP connections certain users
> will want to have a long delay (eg: wait a minute before your FTP transfer
> starts).
>
>> The rest of this all seems fine. I've got no other thoughts to add.
>
> Great, thanks!
>
>> So the question is, which route do we go?
>> 1. 32-bit int (12/20 split)
>> 2. 64-bit int (32/32 split)
>
> I think you can start with the 12/20 split first, knowing that if
> anybody ever requests an extension of either time or classes we can
> switch to 64 without much effort. Until this happens we'll be more
> efficient on small systems. Conversely we can't do it the other way
> around because we could break some setups :-)
>
>> 3. 64-bit int (16/48 split) with absolute timestamps
>
> I don't like it much for the reasons above.
>
>> 4. 64-bit int with a single `set-priority` action. Let the user use the
>> a new`ms()` fetch for time based. The user does the bit shifting if they
>> want to use both (using the `mul()` converter, or we can add a
`shift()`).
>
> I think this level of complexity ought to be done once in the code and not
> each and every time for all users. It would only be manageable for a
number
> of the people on this list, but newcomers would call us words for this :-)
>
> Thanks,
> Willy

Ok, so this patch set includes a fully functional implementation of the
priority class & offset. Unfortunately handling the offset got really
complicated due to the wrapping & locking.
There might be a little room for improvement, but I don't think much.

I'm still working on the code, but I wanted to submit this for review as
the final version probably won't be much different.

The second patch is a test I wrote to try and catch all the different
scenarios. I'm including in case you were inclined to play with the tree
search algorithm and test it. It's not intended to be merged.

Things to do:
* Try and optimize the macros & tree search.
* Output a warning if `timeout queue` > max offset.
* Add LUA methods & sample fetches.
* General cleanup.
* Docs

-Patrick


Patrick Hemmer (2):
MEDIUM: add set-priority-class and set-priority-offset
tests for queue set-priority

Makefile.test | 21 ++++
include/types/proxy.h | 2 +-
include/types/queue.h | 2 +-
include/types/server.h | 2 +-
include/types/stream.h | 2 +
src/proxy.c | 2 +-
src/queue.c | 280
++++++++++++++++++++++++++++++++++++++++++-------
src/server.c | 2 +-
src/stream.c | 2 +
tests/test-queue.cfg | 8 ++
tests/test_queue.c | 117 +++++++++++++++++++++
11 files changed, 396 insertions(+), 44 deletions(-)
create mode 100644 Makefile.test
create mode 100644 tests/test-queue.cfg
create mode 100644 tests/test_queue.c

--
2.16.3
This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content.
The priority values are used when connections are queued to determine
which connections should be served first. The lowest priority class is
served first. When multiple requests from the same class are found, the
earliest (according to queue_time + offset) is served first.
---
include/types/proxy.h | 2 +-
include/types/queue.h | 2 +-
include/types/server.h | 2 +-
include/types/stream.h | 2 +
src/proxy.c | 2 +-
src/queue.c | 280
++++++++++++++++++++++++++++++++++++++++++-------
src/server.c | 2 +-
src/stream.c | 2 +
8 files changed, 250 insertions(+), 44 deletions(-)
Patrick Hemmer
[PATCH 2/2] tests for queue set-priority
May 10, 2018 07:40AM
---
Makefile.test | 21 +++++++++
tests/test-queue.cfg | 8 ++++
tests/test_queue.c | 117
+++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 146 insertions(+)
On 2018/5/10 01:35, Patrick Hemmer wrote:
> This adds the set-priority-class and set-priority-offset actions to
> http-request and tcp-request content.
> The priority values are used when connections are queued to determine
> which connections should be served first. The lowest priority class is
> served first. When multiple requests from the same class are found, the
> earliest (according to queue_time + offset) is served first.
> ---
> include/types/proxy.h | 2 +-
> include/types/queue.h | 2 +-
> include/types/server.h | 2 +-
> include/types/stream.h | 2 +
> src/proxy.c | 2 +-
> src/queue.c | 280
> ++++++++++++++++++++++++++++++++++++++++++-------
> src/server.c | 2 +-
> src/stream.c | 2 +
> 8 files changed, 250 insertions(+), 44 deletions(-)
>
>
Noticed I had some uncommitted changes. Not much, but previous version
might not work right.

-Patrick
From 240a54488c01c4b0a15a561b8e7533922a487492 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 4 May 2018 16:31:16 -0400
Subject: [PATCH] MEDIUM: add set-priority-class and set-priority-offset

This adds the set-priority-class and set-priority-offset actions to http-request and tcp-request content.
The priority values are used when connections are queued to determine which connections should be served first. The lowest priority class is served first. When multiple requests from the same class are found, the earliest (according to queue_time + offset) is served first.
---
include/types/proxy.h | 2 +-
include/types/queue.h | 2 +-
include/types/server.h | 2 +-
include/types/stream.h | 2 +
src/hlua.c | 4 +-
src/proxy.c | 2 +-
src/queue.c | 280 ++++++++++++++++++++++++++++++++++++++++++-------
src/server.c | 2 +-
src/stream.c | 2 +
9 files changed, 252 insertions(+), 46 deletions(-)

diff --git a/include/types/proxy.h b/include/types/proxy.h
index 16c13a1c1..2cc1dfd9e 100644
--- a/include/types/proxy.h
+++ b/include/types/proxy.h
@@ -322,7 +322,7 @@ struct proxy {
int serverfin; /* timeout to apply to server half-closed connections */
} timeout;
char *id, *desc; /* proxy id (name) and description */
- struct list pendconns; /* pending connections with no server assigned yet */
+ struct eb_root pendconns; /* pending connections with no server assigned yet */
int nbpend; /* number of pending connections with no server assigned yet */
int totpend; /* total number of pending connections on this instance (for stats) */
unsigned int feconn, beconn; /* # of active frontend and backends streams */
diff --git a/include/types/queue.h b/include/types/queue.h
index 42dbbd047..03377da69 100644
--- a/include/types/queue.h
+++ b/include/types/queue.h
@@ -35,7 +35,7 @@ struct pendconn {
struct stream *strm;
struct proxy *px;
struct server *srv; /* the server we are waiting for, may be NULL */
- struct list list; /* next pendconn */
+ struct eb32_node node;
__decl_hathreads(HA_SPINLOCK_T lock);
};

diff --git a/include/types/server.h b/include/types/server.h
index 0cd20c096..5339c911e 100644
--- a/include/types/server.h
+++ b/include/types/server.h
@@ -213,7 +213,7 @@ struct server {
struct freq_ctr sess_per_sec; /* sessions per second on this server */
struct be_counters counters; /* statistics counters */

- struct list pendconns; /* pending connections */
+ struct eb_root pendconns; /* pending connections */
struct list actconns; /* active connections */
struct list *priv_conns; /* private idle connections attached to stream interfaces */
struct list *idle_conns; /* sharable idle connections attached or not to a stream interface */
diff --git a/include/types/stream.h b/include/types/stream.h
index 0dbc79f44..489a0b648 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -125,6 +125,8 @@ struct stream {

struct server *srv_conn; /* stream already has a slot on a server and is not in queue */
struct pendconn *pend_pos; /* if not NULL, points to the pending position in the pending queue */
+ int16_t priority_class; /* priority class of the stream for the pending queue */
+ int32_t priority_offset; /* priority offset of the stream for the pending queue */

struct http_txn *txn; /* current HTTP transaction being processed. Should become a list. */

diff --git a/src/hlua.c b/src/hlua.c
index 65fa62ff6..1df8edfd3 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -7859,7 +7859,7 @@ void hlua_init(void)
socket_tcp.proxy = &socket_proxy;
socket_tcp.obj_type = OBJ_TYPE_SERVER;
LIST_INIT(&socket_tcp.actconns);
- LIST_INIT(&socket_tcp.pendconns);
+ socket_tcp.pendconns = EB_ROOT;
socket_tcp.priv_conns = NULL;
socket_tcp.idle_conns = NULL;
socket_tcp.safe_conns = NULL;
@@ -7905,7 +7905,7 @@ void hlua_init(void)
socket_ssl.proxy = &socket_proxy;
socket_ssl.obj_type = OBJ_TYPE_SERVER;
LIST_INIT(&socket_ssl.actconns);
- LIST_INIT(&socket_ssl.pendconns);
+ socket_ssl.pendconns = EB_ROOT;
socket_tcp.priv_conns = NULL;
socket_tcp.idle_conns = NULL;
socket_tcp.safe_conns = NULL;
diff --git a/src/proxy.c b/src/proxy.c
index 31253f14d..63a076855 100644
--- a/src/proxy.c
+++ b/src/proxy.c
@@ -726,7 +726,7 @@ void init_new_proxy(struct proxy *p)
{
memset(p, 0, sizeof(struct proxy));
p->obj_type = OBJ_TYPE_PROXY;
- LIST_INIT(&p->pendconns);
+ p->pendconns = EB_ROOT;
LIST_INIT(&p->acl);
LIST_INIT(&p->http_req_rules);
LIST_INIT(&p->http_res_rules);
diff --git a/src/queue.c b/src/queue.c
index 1c730c75c..0174ddc75 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -14,12 +14,16 @@
#include <common/memory.h>
#include <common/time.h>
#include <common/hathreads.h>
+#include <eb32tree.h>

+#include <proto/proto_http.h>
#include <proto/queue.h>
+#include <proto/sample.h>
#include <proto/server.h>
#include <proto/stream.h>
#include <proto/stream_interface.h>
#include <proto/task.h>
+#include <proto/tcp_rules.h>


struct pool_head *pool_head_pendconn;
@@ -76,8 +80,69 @@ static void pendconn_unlink(struct pendconn *p)
else
p->px->nbpend--;
HA_ATOMIC_SUB(&p->px->totpend, 1);
- LIST_DEL(&p->list);
- LIST_INIT(&p->list);
+ eb32_delete(&p->node);
+}
+
+#define offset_boundary() (now_ms - (TIMER_LOOK_BACK >> 12) & 0xfffff)
+#define key_class(key) (key & 0xfff00000)
+#define key_offset(key) (key & 0xfffff)
+#define key_class_boundary(key) (key_class(key) | offset_boundary())
+static u32 key_incr(u32 key) {
+ u32 key_next;
+
+ key_next = key + 1;
+ if (key_class(key_next) != key_class(key))
+ key_next = key_class(key_next);
+ else if (key_next == key_class_boundary(key))
+ key_next += 0x100000;
+
+ return key_next;
+}
+
+static struct pendconn *next_pendconn(struct eb_root *pendconns, u32 min) {
+ struct eb32_node *node, *node2 = NULL;
+ u32 max;
+
+ // min is inclusive
+ // max is exclusive
+ max = key_class_boundary(min);
+
+ node = eb32_lookup_ge(pendconns, min);
+
+ if (node) {
+ if (node->key < max || (max <= min && key_class(node->key) == key_class(min)))
+ return eb32_entry(node, struct pendconn, node);
+ if (key_class(node->key) != key_class(min))
+ node2 = node;
+ if (max > min)
+ goto class_next;
+ }
+
+ if (max <= min)
+ node = eb32_lookup_ge(pendconns, key_class(min));
+ if (!node)
+ return NULL;
+ if (node->key < max && node->key < min)
+ return eb32_entry(node, struct pendconn, node);
+
+class_next:
+ if (node2) {
+ min = key_class_boundary(node2->key);
+ if (node2->key >= min)
+ return eb32_entry(node2, struct pendconn, node);
+ } else
+ min = key_class_boundary(min) + 0x100000;
+ node = eb32_lookup_ge(pendconns, min);
+ if (node && key_class(node->key) == key_class(min))
+ return eb32_entry(node, struct pendconn, node);
+ if (node2)
+ return eb32_entry(node2, struct pendconn, node);
+ // theoretically this should never find anything we haven't already found
+ //node = eb32_lookup_ge(pendconns, key_class(min));
+ //if (node)
+ // return eb32_entry(node, struct pendconn, node);
+
+ return NULL;
}

/* Process the next pending connection from either a server or a proxy, and
@@ -100,8 +165,9 @@ static void pendconn_unlink(struct pendconn *p)
*/
static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
{
- struct pendconn *p = NULL;
+ struct pendconn *p = NULL, *pp = NULL;
struct server *rsrv;
+ u32 pkey, ppkey;
int remote;

rsrv = srv->track;
@@ -109,38 +175,53 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
rsrv = srv;

if (srv->nbpend) {
- list_for_each_entry(p, &srv->pendconns, list) {
+ for (p = next_pendconn(&srv->pendconns, offset_boundary());
+ p;
+ p = next_pendconn(&srv->pendconns, key_incr(p->node.key)))
if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &p->lock))
- goto ps_found;
- }
- p = NULL;
+ break;
}
-
- ps_found:
- if (srv_currently_usable(rsrv) && px->nbpend) {
- struct pendconn *pp;
-
- list_for_each_entry(pp, &px->pendconns, list) {
- /* If the server pendconn is older than the proxy one,
- * we process the server one. */
- if (p && !tv_islt(&pp->strm->logs.tv_request, &p->strm->logs.tv_request))
- goto pendconn_found;
-
- if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &pp->lock)) {
- /* Let's switch from the server pendconn to the
- * proxy pendconn. Don't forget to unlock the
- * server pendconn, if any. */
- if (p)
- HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
- p = pp;
- goto pendconn_found;
- }
- }
+ if (px->nbpend) {
+ for (pp = next_pendconn(&px->pendconns, offset_boundary());
+ pp;
+ pp = next_pendconn(&px->pendconns, key_incr(pp->node.key)))
+ if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &pp->lock))
+ break;
}

- if (!p)
+ if (!p && !pp)
return 0;

+ if (p && !pp)
+ goto pendconn_found;
+ if (pp && !p) {
+ p = pp;
+ goto pendconn_found;
+ }
+ if (key_class(p->node.key) < key_class(pp->node.key)) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &pp->lock);
+ goto pendconn_found;
+ }
+ if (key_class(pp->node.key) < key_class(p->node.key)) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
+ p = pp;
+ goto pendconn_found;
+ }
+
+ pkey = key_offset(p->node.key);
+ ppkey = key_offset(pp->node.key);
+ if (pkey < offset_boundary())
+ pkey += (TIMER_LOOK_BACK >> 12);
+ if (ppkey < offset_boundary())
+ ppkey += (TIMER_LOOK_BACK >> 12);
+ if (pkey <= ppkey) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &pp->lock);
+ goto pendconn_found;
+ }
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
+ p = pp;
+ goto pendconn_found;
+
pendconn_found:
pendconn_unlink(p);
p->strm_flags |= SF_ASSIGNED;
@@ -206,6 +287,7 @@ struct pendconn *pendconn_add(struct stream *strm)
srv = objt_server(strm->target);
px = strm->be;

+ p->node.key = ((u32)(strm->priority_class + 0x7ff) << 20) | ((u32)(now_ms + strm->priority_offset) & 0xfffff);
p->srv = NULL;
p->px = px;
p->strm = strm;
@@ -219,7 +301,7 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->logs.srv_queue_size += srv->nbpend;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
- LIST_ADDQ(&srv->pendconns, &p->list);
+ eb32_insert(&srv->pendconns, &p->node);
HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
}
else {
@@ -228,7 +310,7 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->logs.prx_queue_size += px->nbpend;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
- LIST_ADDQ(&px->pendconns, &p->list);
+ eb32_insert(&px->pendconns, &p->node);
HA_SPIN_UNLOCK(PROXY_LOCK, &px->lock);
}
HA_ATOMIC_ADD(&px->totpend, 1);
@@ -241,7 +323,8 @@ struct pendconn *pendconn_add(struct stream *strm)
*/
int pendconn_redistribute(struct server *s)
{
- struct pendconn *p, *pback;
+ struct pendconn *p;
+ struct eb32_node *node;
int xferred = 0;
int remote = 0;

@@ -251,7 +334,10 @@ int pendconn_redistribute(struct server *s)
return 0;

HA_SPIN_LOCK(SERVER_LOCK, &s->lock);
- list_for_each_entry_safe(p, pback, &s->pendconns, list) {
+ for (node = eb32_first(&s->pendconns);
+ node;
+ node = eb32_lookup_ge(&s->pendconns, node->key + 1)) {
+ p = eb32_entry(&node, struct pendconn, node);
if (p->strm_flags & SF_FORCE_PRST)
continue;

@@ -280,7 +366,8 @@ int pendconn_redistribute(struct server *s)
*/
int pendconn_grab_from_px(struct server *s)
{
- struct pendconn *p, *pback;
+ struct pendconn *p;
+ struct eb32_node *node;
int maxconn, xferred = 0;
int remote = 0;

@@ -289,7 +376,10 @@ int pendconn_grab_from_px(struct server *s)

HA_SPIN_LOCK(PROXY_LOCK, &s->proxy->lock);
maxconn = srv_dynamic_maxconn(s);
- list_for_each_entry_safe(p, pback, &s->proxy->pendconns, list) {
+ for (node = eb32_first(&s->proxy->pendconns);
+ node;
+ node = eb32_lookup_ge(&s->proxy->pendconns, node->key + 1)) {
+ p = eb32_entry(&node, struct pendconn, node);
if (s->maxconn && s->served + xferred >= maxconn)
break;

@@ -337,7 +427,7 @@ int pendconn_dequeue(struct stream *strm)

/* the pendconn is still linked to the server/proxy queue, so unlock it
* and go away. */
- if (!LIST_ISEMPTY(&p->list)) {
+ if (p->node.node.leaf_p) {
HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
return 1;
}
@@ -369,19 +459,19 @@ void pendconn_free(struct pendconn *p)
HA_SPIN_LOCK(PENDCONN_LOCK, &p->lock);

/* The pendconn was already unlinked, just release it. */
- if (LIST_ISEMPTY(&p->list))
+ if (!p->node.node.leaf_p)
goto release;

if (p->srv) {
HA_SPIN_LOCK(SERVER_LOCK, &p->srv->lock);
p->srv->nbpend--;
- LIST_DEL(&p->list);
+ eb32_delete(&p->node);
HA_SPIN_UNLOCK(SERVER_LOCK, &p->srv->lock);
}
else {
HA_SPIN_LOCK(PROXY_LOCK, &p->px->lock);
p->px->nbpend--;
- LIST_DEL(&p->list);
+ eb32_delete(&p->node);
HA_SPIN_UNLOCK(PROXY_LOCK, &p->px->lock);
}
HA_ATOMIC_SUB(&p->px->totpend, 1);
@@ -392,6 +482,118 @@ void pendconn_free(struct pendconn *p)
pool_free(pool_head_pendconn, p);
}

+static enum act_return action_set_priority_class(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (!smp)
+ return ACT_RET_CONT;
+
+ if (smp->data.u.sint < -0x7ff)
+ smp->data.u.sint = -0x7ff;
+ else if (smp->data.u.sint > 0x7ff)
+ smp->data.u.sint = 0x7ff;
+
+ s->priority_class = smp->data.u.sint;
+ return ACT_RET_CONT;
+}
+
+static enum act_return action_set_priority_offset(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (!smp)
+ return ACT_RET_CONT;
+
+ if (smp->data.u.sint < -0x7ffff)
+ smp->data.u.sint = -0x7ffff;
+ else if (smp->data.u.sint > 0x7ffff)
+ smp->data.u.sint = 0x7ffff;
+
+ s->priority_offset = smp->data.u.sint;
+
+ return ACT_RET_CONT;
+}
+
+static enum act_parse_ret parse_set_priority_class(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority_class;
+ return ACT_RET_PRS_OK;
+}
+
+static enum act_parse_ret parse_set_priority_offset(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority_offset;
+ return ACT_RET_PRS_OK;
+}
+
+static struct action_kw_list tcp_cont_kws = {ILH, {
+ { "set-priority-class", parse_set_priority_class },
+ { "set-priority-offset", parse_set_priority_offset },
+ { /* END */ }
+}};
+
+static struct action_kw_list http_req_kws = {ILH, {
+ { "set-priority-class", parse_set_priority_class },
+ { "set-priority-offset", parse_set_priority_offset },
+ { /* END */ }
+}};
+
+__attribute__((constructor))
+static void __queue_init(void)
+{
+ tcp_req_cont_keywords_register(&tcp_cont_kws);
+ http_req_keywords_register(&http_req_kws);
+}
+
/*
* Local variables:
* c-indent-level: 8
diff --git a/src/server.c b/src/server.c
index ebac357fb..5617d8d67 100644
--- a/src/server.c
+++ b/src/server.c
@@ -1575,7 +1575,7 @@ static struct server *new_server(struct proxy *proxy)
srv->obj_type = OBJ_TYPE_SERVER;
srv->proxy = proxy;
LIST_INIT(&srv->actconns);
- LIST_INIT(&srv->pendconns);
+ srv->pendconns = EB_ROOT;

if ((srv->priv_conns = calloc(global.nbthread, sizeof(*srv->priv_conns))) == NULL)
goto free_srv;
diff --git a/src/stream.c b/src/stream.c
index 1d0b22ca3..9b22becc9 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -220,6 +220,8 @@ struct stream *stream_new(struct session *sess, enum obj_type *origin)
s->target = sess->listener ? sess->listener->default_target : NULL;

s->pend_pos = NULL;
+ s->priority_class = 0;
+ s->priority_offset = 0;

/* init store persistence */
s->store_count = 0;
--
2.16.3
Patrick Hemmer
[PATCH 0/2] Re: Priority based queuing
May 11, 2018 07:00PM
Ok, so here is the full submission for priority based queuing.

Notes since previous update:

Wasn't really able to optimize the tree search any. Tried a few things,
but nothing made a measurable performance difference.

I added a warning message and documentation making clear the issues with
timestamp wrapping.
Though one thing that might not be completely obvious is that even if
the user does not configure `set-priority-offset` at all, they're still
susceptible to the wrapping issue as the priority is the queue key
whether priority is adjusted or not.

The implementation of the %sq (srv_queue) and %bq (backend_queue) was
change to keep the description accurate. The description is "number of
requests which were processed before this one". The previous
implementation just stored the size of the queue at the time the
connection was queued. Since we can inject a connection into the middle
of the queue, this no longer works. Now we keep a count of dequeued
connections, and take the difference between when the connection was
queued, and then dequeued. This also means the value will be slightly
different even for users who don't use priority, as the previous method
would have included connections which closed without being processed.

I added sample fetches for retrieving the class/offset of the current
transaction. I think it might be beneficial to add some other fetches
for tracking the health of the queue, such as average class/offset, or
an exponential moving average of the class/offset for requests added to
the queue, requests processed, and requests which closed/timed out. But
this is just more stuff the code would have to store, so unsure if
they're worth it.


I wasn't convinced the 64-bit key was a bad idea, so I implemented the
idea with a 12/52 split and an absolute timestamp. On my system (which
is 64-bit) the performance is about 20% faster. The code is much
simpler. And it also solves the limitations and issues with wrapping.
The patch for this is included in case it's of interest.

-Patrick


Patrick Hemmer (2):
MEDIUM: add set-priority-class and set-priority-offset
use a 64-bit int with absolute timestamp for priority-offset

doc/configuration.txt | 38 +++++++
doc/lua-api/index.rst | 18 ++++
include/types/proxy.h | 3 +-
include/types/queue.h | 2 +-
include/types/server.h | 3 +-
include/types/stream.h | 7 +-
src/cfgparse.c | 15 +++
src/hlua.c | 69 +++++++++----
src/log.c | 4 +-
src/proto_http.c | 4 +-
src/proxy.c | 2 +-
src/queue.c | 261
+++++++++++++++++++++++++++++++++++++++++--------
src/server.c | 2 +-
src/stream.c | 10 +-
14 files changed, 366 insertions(+), 72 deletions(-)

--
2.16.3
This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content.
The priority values are used when connections are queued to determine
which connections should be served first. The lowest priority class is
served first. When multiple requests from the same class are found, the
earliest (according to queue_time + offset) is served first.
---
doc/configuration.txt | 38 ++++++
doc/lua-api/index.rst | 18 +++
include/types/proxy.h | 3 +-
include/types/queue.h | 2 +-
include/types/server.h | 3 +-
include/types/stream.h | 7 +-
src/cfgparse.c | 15 +++
src/hlua.c | 74 ++++++++---
src/log.c | 4 +-
src/proto_http.c | 4 +-
src/proxy.c | 2 +-
src/queue.c | 345
++++++++++++++++++++++++++++++++++++++++++-------
src/server.c | 2 +-
src/stream.c | 10 +-
14 files changed, 453 insertions(+), 74 deletions(-)
---
include/types/queue.h | 2 +-
src/hlua.c | 5 --
src/queue.c | 144
+++++++++++---------------------------------------
3 files changed, 33 insertions(+), 118 deletions(-)
On 2018/5/11 12:52, Patrick Hemmer wrote:
> This adds the set-priority-class and set-priority-offset actions to
> http-request and tcp-request content.
> The priority values are used when connections are queued to determine
> which connections should be served first. The lowest priority class is
> served first. When multiple requests from the same class are found, the
> earliest (according to queue_time + offset) is served first.
> ---
> doc/configuration.txt | 38 ++++++
> doc/lua-api/index.rst | 18 +++
> include/types/proxy.h | 3 +-
> include/types/queue.h | 2 +-
> include/types/server.h | 3 +-
> include/types/stream.h | 7 +-
> src/cfgparse.c | 15 +++
> src/hlua.c | 74 ++++++++---
> src/log.c | 4 +-
> src/proto_http.c | 4 +-
> src/proxy.c | 2 +-
> src/queue.c | 345
> ++++++++++++++++++++++++++++++++++++++++++-------
> src/server.c | 2 +-
> src/stream.c | 10 +-
> 14 files changed, 453 insertions(+), 74 deletions(-)
>
>
I found one issue I'll need to fix. During final code cleanup I
accidentally replaced good code with the `key_incr` function in 2 places
(pendconn_redistribute and pendconn_grab_from_px). Will fix and submit
new patch pending feedback.

-Patrick
Hi Patrick,

I'm finally back on this.

On Mon, May 14, 2018 at 01:05:03AM -0400, Patrick Hemmer wrote:
> On 2018/5/11 12:52, Patrick Hemmer wrote:
> > This adds the set-priority-class and set-priority-offset actions to
> > http-request and tcp-request content.
> > The priority values are used when connections are queued to determine
> > which connections should be served first. The lowest priority class is
> > served first. When multiple requests from the same class are found, the
> > earliest (according to queue_time + offset) is served first.
> > ---
> > doc/configuration.txt | 38 ++++++
> > doc/lua-api/index.rst | 18 +++
> > include/types/proxy.h | 3 +-
> > include/types/queue.h | 2 +-
> > include/types/server.h | 3 +-
> > include/types/stream.h | 7 +-
> > src/cfgparse.c | 15 +++
> > src/hlua.c | 74 ++++++++---
> > src/log.c | 4 +-
> > src/proto_http.c | 4 +-
> > src/proxy.c | 2 +-
> > src/queue.c | 345
> > ++++++++++++++++++++++++++++++++++++++++++-------
> > src/server.c | 2 +-
> > src/stream.c | 10 +-
> > 14 files changed, 453 insertions(+), 74 deletions(-)
> >
> >
> I found one issue I'll need to fix. During final code cleanup I
> accidentally replaced good code with the `key_incr` function in 2 places
> (pendconn_redistribute and pendconn_grab_from_px). Will fix and submit
> new patch pending feedback.

So I spent some time testing the code yesterday. Overall it works pretty
well, and allows to maintain very low response times for certain requests
while others are higher (when using the class) or to finely adjust the
response time distribution between requests.

I found that using classes under high loads result in the highest priority
requests to totally exclude the other ones, which is expected, while offsets
provide a smoother transition but may not fit all purposes either.

I noticed a strange effect which is that when injecting under low load with
a higher priority (either offset or class) than another high level traffic,
the response time on the higher priority traffic follows a sawtooth shape,
it progressively raises from 0 to 50-80ms and suddenly drops to zero again.

I looked at the code to see if something could cause that. I found that the
key increment could be a reason (you must restart from the next element,
not an upper value since there will be many duplicate keys) but it doesn't
change anything. It could be something in the lower layers as well, I don't
really know for now. At least what I've seen is that the server's response
time when attacked directly, is stable. It's very hard to say if this is
due to the change or not given that it was not possible to set priorities
in the past!

By the way, regarding this "restart from next" thing, I looked at the locking
and the element you find is perfectly safe to be used to get the next one. So
the tree progressing can definitely restart using eb32_next(node), modulo the
wrapping going back to eb32_first() as we do in the scheduler. I'll see if I
can propose you something around this.

I found some minor suggestions that I can take care of, like this :

@@ -125,6 +125,9 @@ struct stream {

struct server *srv_conn; /* stream already has a slot on a server and is not in queue */
struct pendconn *pend_pos; /* if not NULL, points to the pending position in the pending queue */
+ int16_t priority_class; /* priority class of the stream for the pending queue */
+ int32_t priority_offset; /* priority offset of the stream for the pending queue */
+ unsigned int cntdepend; /* value of proxy/server cntdepend at time of enqueue */

struct http_txn *txn; /* current HTTP transaction being processed. Should become a list. */

This is creating a 48-bit hole between the two fields, I'll check if there's
a nearby 16-bit hole to place the priority_class entry. Otherwise I'll add
a comment to mention there's a hole there.

--- a/src/cfgparse.c
+++ b/src/cfgparse.c
@@ -8131,6 +8131,21 @@ out_uri_auth_compat:
}
}

+ if ((curproxy->mode == PR_MODE_TCP || curproxy->mode == PR_MODE_HTTP) &&
+ (curproxy->cap & PR_CAP_BE) && (curproxy->srv) &&
+ (!curproxy->timeout.queue || curproxy->timeout.queue > (((TIMER_LOOK_BACK >> 12) & 0xfffff) / 2))) {
+ // Note that this warning isn't comprehensive.
+ // if the user specifies set-priority-offset > 'timeout queue`, wrapping
+ // may occur and get de-queued out of order. But logging this every
+ // single time might be too noisy.
+ ha_warning("config : excessive timeout queue for backend '%s'.\n"
+ " | The 'timeout queue' setting is either missing, or exceeds the maximum\n"
+ " | priority offset. If a request sits in the queue for longer than the maximum\n"
+ " | priority offset, it may get de-queued out of order.\n",
+ curproxy->id);
+ err_code |= ERR_WARN;
+ }
+
if ((curproxy->options2 & PR_O2_CHK_ANY) == PR_O2_SSL3_CHK) {
curproxy->check_len = sizeof(sslv3_client_hello_pkt) - 1;
curproxy->check_req = malloc(curproxy->check_len);

This should be part of another commit as it will possibly cause warnings
with currently valid configurations. Also, given that there is possibly
nothing the user can do about it, I'm really not sure that a warning is
welcome there. The purpose of warnings is always to encourage users to
fix them. This would be pretty fine for a diagnostic level though (which
sadly we still don't have).

For all the bounds set on the class and offset values, it would be better
to have some functions than to perform these checks everywhere :

static inline int queue_adjust_class(int class)
{
if (class < -0x7ff)
class = -0x7ff;
else if (class > 0x7ff)
class = 0x7ff;
return class;
}

static inline int queue_adjust_offset(int offset)
{
if (offset < -0x7ffff)
offset = -0x7ffff;
else if (offset > 0x7ffff)
offset = 0x7ffff;
return offset;
}


--- a/include/types/proxy.h
+++ b/include/types/proxy.h
@@ -322,9 +322,10 @@ struct proxy {
int serverfin; /* timeout to apply to server half-closed connections */
} timeout;
char *id, *desc; /* proxy id (name) and description */
- struct list pendconns; /* pending connections with no server assigned yet */
+ struct eb_root pendconns; /* pending connections with no server assigned yet */
int nbpend; /* number of pending connections with no server assigned yet */
int totpend; /* total number of pending connections on this instance (for stats) */
+ unsigned int cntdepend; /* number of pending connections which have been de-queued */
unsigned int feconn, beconn; /* # of active frontend and backends streams */
struct freq_ctr fe_req_per_sec; /* HTTP requests per second on the frontend */
struct freq_ctr fe_conn_per_sec; /* received connections per second on the frontend */

We need to find a more suitable name for this "cntdepend" as it really doesn't
tell me that it has anything to do with what it says. I don't have anything
to propose for now I'm afraid.


--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -105,8 +105,8 @@ struct strm_logs {
long t_connect; /* delay before the connect() to the server succeeds, -1 if never occurs */
long t_data; /* delay before the first data byte from the server ... */
unsigned long t_close; /* total stream duration */
- unsigned long srv_queue_size; /* number of streams waiting for a connect slot on this server at accept() time (in di
- unsigned long prx_queue_size; /* overall number of streams waiting for a connect slot on this instance at accept() t
+ unsigned long srv_queue_pos; /* number of streams de-queued while waiting for a connection slot on this server */
+ unsigned long prx_queue_pos; /* number of streams de-qeuued while waiting for a connection slot on this instance */
long long bytes_in; /* number of bytes transferred from the client to the server */
long long bytes_out; /* number of bytes transferred from the server to the client */
};

I found that a non-negligible part of the patch is in fact due only to
this rename and that needlessly impacts the review or code study (eg if
we later bisect to this patch when digging an issue). I'm not against
this rename which apparently makes sense, but we should do it in a first
patch. In fact it's the same for the cntdepend one above, apparently you
store a new metric in the proxy which helps calculating the queue size
differently, and this change is independant on the offset+class change,
so let's do it separately as well. If we later notice a change in the
reported queue sizes or any such thing, it will be easier to know if
it's a different reporting due to the way the measure is now performed
or if it's a result of the change of algorithm.

@@ -2467,6 +2469,10 @@ struct task *process_stream(struct task *t, void *context, unsigned short state)
return t; /* nothing more to do */
}

+ // remove from pending queue here so we update counters
+ if (s->pend_pos)
+ pendconn_free(s->pend_pos);
+
if (s->flags & SF_BE_ASSIGNED)
HA_ATOMIC_SUB(&s->be->beconn, 1);

This part is not supposed to be needed since it's already performed
in stream_free() a few lines later. Or if it's required to get the
queue values correctly before logging, then something similar will
be required at every place where we log after trying to connect to
a server (there are a few places depending on the log format and
the logasap option).

Now I think I understand how the cntdepend works : isn't it simply a
wrapping queue index that you use to measure the difference between
when you queued and when you dequeued ? In this case, wouldn't something
like "queue_idx" be more explicit ? At least it becomes more obvious when
doing the measure that we measure a length by subtracting two indexes.

I'm going back to trying to add some detailed timestamps at some points
in order to figure if we're spending a growing time in the queue or
anywhere else regarding this strange timer variation thing.

Cheers,
Willy
On Wed, May 30, 2018 at 10:00:24AM +0200, Willy Tarreau wrote:
> I noticed a strange effect which is that when injecting under low load with
> a higher priority (either offset or class) than another high level traffic,
> the response time on the higher priority traffic follows a sawtooth shape,
> it progressively raises from 0 to 50-80ms and suddenly drops to zero again.

OK I found what causes this, and that totally makes sense. It's due to the
fact that I'm using two independant injectors, one requesting a 10ms page
and the other one requesting a 100ms one and able to fill the queue. Each
time a slow request is dequeued, it's one less slot available for a quick
request so the average service time increases, resulting in a higher average
wait time in the queue for the first slot to be empty. As fast requests are
slowed down, there are more opportunities to add slow ones, hence to slow
down the service, until the point where 100% of the slow requests are being
served in parallel, resulting in none of them in the queue, which is filled
with the fast ones. As soon as all these slow requests are completed, all
the slow ones are served immediately, resulting in a much faster service
time for all of them, and progressively the slow ones come again.

So this is completely normal and expected in this test. It's just not
intuitive.

The way to combat this is to add another setting which we currently don't
have, which is the maximum load a server can have to be served by a given
request otherwise it's forced into the queue. For example, if we say that
the slow requests cannot use more than 90% of a server's connexions, there
will always be 10% available for the other ones, thus completely eliminating
the queue for them. It's a bit trickier to implement because it requires
than when we dequeue pendconns, if we find one which doesn't validate the
server's load, we try a next one, and this can be expensive, especially
since most of the time there will be very few requests allowed to use the
server to the max. A speedup would be necessary, involving a two-dimensional
tree lookup, or maybe a higher bit field containing the server's available
slots (twos complement of the entry above, looked up from maxconn-currconn).

That's possibly something to think about in the future but it needs further
investigation.

Cheers,
Willy
On 2018/5/30 04:00, Willy Tarreau wrote:
> Hi Patrick,
>
> I'm finally back on this.
>
> On Mon, May 14, 2018 at 01:05:03AM -0400, Patrick Hemmer wrote:
>> On 2018/5/11 12:52, Patrick Hemmer wrote:
>>> This adds the set-priority-class and set-priority-offset actions to
>>> http-request and tcp-request content.
>>> The priority values are used when connections are queued to determine
>>> which connections should be served first. The lowest priority class is
>>> served first. When multiple requests from the same class are found, the
>>> earliest (according to queue_time + offset) is served first.
>>> ---
>>> doc/configuration.txt | 38 ++++++
>>> doc/lua-api/index.rst | 18 +++
>>> include/types/proxy.h | 3 +-
>>> include/types/queue.h | 2 +-
>>> include/types/server.h | 3 +-
>>> include/types/stream.h | 7 +-
>>> src/cfgparse.c | 15 +++
>>> src/hlua.c | 74 ++++++++---
>>> src/log.c | 4 +-
>>> src/proto_http.c | 4 +-
>>> src/proxy.c | 2 +-
>>> src/queue.c | 345
>>> ++++++++++++++++++++++++++++++++++++++++++-------
>>> src/server.c | 2 +-
>>> src/stream.c | 10 +-
>>> 14 files changed, 453 insertions(+), 74 deletions(-)
>>>
>>>
>> I found one issue I'll need to fix. During final code cleanup I
>> accidentally replaced good code with the `key_incr` function in 2 places
>> (pendconn_redistribute and pendconn_grab_from_px). Will fix and submit
>> new patch pending feedback.
> So I spent some time testing the code yesterday. Overall it works pretty
> well, and allows to maintain very low response times for certain requests
> while others are higher (when using the class) or to finely adjust the
> response time distribution between requests.
>
> I found that using classes under high loads result in the highest priority
> requests to totally exclude the other ones, which is expected, while offsets
> provide a smoother transition but may not fit all purposes either.
>
> I noticed a strange effect which is that when injecting under low load with
> a higher priority (either offset or class) than another high level traffic,
> the response time on the higher priority traffic follows a sawtooth shape,
> it progressively raises from 0 to 50-80ms and suddenly drops to zero again.
>
> I looked at the code to see if something could cause that. I found that the
> key increment could be a reason (you must restart from the next element,
> not an upper value since there will be many duplicate keys)
Gah, I completely forgot about duplicate keys. Will fix.

> but it doesn't
> change anything. It could be something in the lower layers as well, I don't
> really know for now. At least what I've seen is that the server's response
> time when attacked directly, is stable. It's very hard to say if this is
> due to the change or not given that it was not possible to set priorities
> in the past!
>
> By the way, regarding this "restart from next" thing, I looked at the locking
> and the element you find is perfectly safe to be used to get the next one. So
> the tree progressing can definitely restart using eb32_next(node), modulo the
> wrapping going back to eb32_first() as we do in the scheduler. I'll see if I
> can propose you something around this.
>
> I found some minor suggestions that I can take care of, like this :
>
> @@ -125,6 +125,9 @@ struct stream {
>
> struct server *srv_conn; /* stream already has a slot on a server and is not in queue */
> struct pendconn *pend_pos; /* if not NULL, points to the pending position in the pending queue */
> + int16_t priority_class; /* priority class of the stream for the pending queue */
> + int32_t priority_offset; /* priority offset of the stream for the pending queue */
> + unsigned int cntdepend; /* value of proxy/server cntdepend at time of enqueue */
>
> struct http_txn *txn; /* current HTTP transaction being processed. Should become a list. */
>
> This is creating a 48-bit hole between the two fields, I'll check if there's
> a nearby 16-bit hole to place the priority_class entry. Otherwise I'll add
> a comment to mention there's a hole there.
>
> --- a/src/cfgparse.c
> +++ b/src/cfgparse.c
> @@ -8131,6 +8131,21 @@ out_uri_auth_compat:
> }
> }
>
> + if ((curproxy->mode == PR_MODE_TCP || curproxy->mode == PR_MODE_HTTP) &&
> + (curproxy->cap & PR_CAP_BE) && (curproxy->srv) &&
> + (!curproxy->timeout.queue || curproxy->timeout.queue > (((TIMER_LOOK_BACK >> 12) & 0xfffff) / 2))) {
> + // Note that this warning isn't comprehensive.
> + // if the user specifies set-priority-offset > 'timeout queue`, wrapping
> + // may occur and get de-queued out of order. But logging this every
> + // single time might be too noisy.
> + ha_warning("config : excessive timeout queue for backend '%s'.\n"
> + " | The 'timeout queue' setting is either missing, or exceeds the maximum\n"
> + " | priority offset. If a request sits in the queue for longer than the maximum\n"
> + " | priority offset, it may get de-queued out of order.\n",
> + curproxy->id);
> + err_code |= ERR_WARN;
> + }
> +
> if ((curproxy->options2 & PR_O2_CHK_ANY) == PR_O2_SSL3_CHK) {
> curproxy->check_len = sizeof(sslv3_client_hello_pkt) - 1;
> curproxy->check_req = malloc(curproxy->check_len);
>
> This should be part of another commit as it will possibly cause warnings
> with currently valid configurations. Also, given that there is possibly
> nothing the user can do about it, I'm really not sure that a warning is
> welcome there. The purpose of warnings is always to encourage users to
> fix them. This would be pretty fine for a diagnostic level though (which
> sadly we still don't have).
The warning is addressable. It means the user should add a `timeout
queue` and set it less than 524287ms. This is similar to the "missing
timeouts for frontend/backend" warning we print. Though I think it would
make sense to add "queue" to that existing warning instead, as it
already mentions the timeouts for client, connect & server.

>
> For all the bounds set on the class and offset values, it would be better
> to have some functions than to perform these checks everywhere :
>
> static inline int queue_adjust_class(int class)
> {
> if (class < -0x7ff)
> class = -0x7ff;
> else if (class > 0x7ff)
> class = 0x7ff;
> return class;
> }
>
> static inline int queue_adjust_offset(int offset)
> {
> if (offset < -0x7ffff)
> offset = -0x7ffff;
> else if (offset > 0x7ffff)
> offset = 0x7ffff;
> return offset;
> }
>
>
> --- a/include/types/proxy.h
> +++ b/include/types/proxy.h
> @@ -322,9 +322,10 @@ struct proxy {
> int serverfin; /* timeout to apply to server half-closed connections */
> } timeout;
> char *id, *desc; /* proxy id (name) and description */
> - struct list pendconns; /* pending connections with no server assigned yet */
> + struct eb_root pendconns; /* pending connections with no server assigned yet */
> int nbpend; /* number of pending connections with no server assigned yet */
> int totpend; /* total number of pending connections on this instance (for stats) */
> + unsigned int cntdepend; /* number of pending connections which have been de-queued */
> unsigned int feconn, beconn; /* # of active frontend and backends streams */
> struct freq_ctr fe_req_per_sec; /* HTTP requests per second on the frontend */
> struct freq_ctr fe_conn_per_sec; /* received connections per second on the frontend */
>
> We need to find a more suitable name for this "cntdepend" as it really doesn't
> tell me that it has anything to do with what it says. I don't have anything
> to propose for now I'm afraid.
Yeah, not fond of the name, but it was something, and is easy to change.
The name originated from "cnt"=counter, and "pend" to be consistent with
the naming of "nbpend" and "totpend" right above it. So: counter of
de-queued pending connections.

>
>
> --- a/include/types/stream.h
> +++ b/include/types/stream.h
> @@ -105,8 +105,8 @@ struct strm_logs {
> long t_connect; /* delay before the connect() to the server succeeds, -1 if never occurs */
> long t_data; /* delay before the first data byte from the server ... */
> unsigned long t_close; /* total stream duration */
> - unsigned long srv_queue_size; /* number of streams waiting for a connect slot on this server at accept() time (in di
> - unsigned long prx_queue_size; /* overall number of streams waiting for a connect slot on this instance at accept() t
> + unsigned long srv_queue_pos; /* number of streams de-queued while waiting for a connection slot on this server */
> + unsigned long prx_queue_pos; /* number of streams de-qeuued while waiting for a connection slot on this instance */
> long long bytes_in; /* number of bytes transferred from the client to the server */
> long long bytes_out; /* number of bytes transferred from the server to the client */
> };
>
> I found that a non-negligible part of the patch is in fact due only to
> this rename and that needlessly impacts the review or code study (eg if
> we later bisect to this patch when digging an issue). I'm not against
> this rename which apparently makes sense, but we should do it in a first
> patch. In fact it's the same for the cntdepend one above, apparently you
> store a new metric in the proxy which helps calculating the queue size
> differently, and this change is independant on the offset+class change,
> so let's do it separately as well. If we later notice a change in the
> reported queue sizes or any such thing, it will be easier to know if
> it's a different reporting due to the way the measure is now performed
> or if it's a result of the change of algorithm.
Agree.

>
> @@ -2467,6 +2469,10 @@ struct task *process_stream(struct task *t, void *context, unsigned short state)
> return t; /* nothing more to do */
> }
>
> + // remove from pending queue here so we update counters
> + if (s->pend_pos)
> + pendconn_free(s->pend_pos);
> +
> if (s->flags & SF_BE_ASSIGNED)
> HA_ATOMIC_SUB(&s->be->beconn, 1);
>
> This part is not supposed to be needed since it's already performed
> in stream_free() a few lines later. Or if it's required to get the
> queue values correctly before logging, then something similar will
> be required at every place where we log after trying to connect to
> a server (there are a few places depending on the log format and
> the logasap option).
Yes, it was put there to update the counters before logging. Re-thinking
this, updating of the counters needs to be moved into
pendconn_process_next_stream anyway. Since this function updates
px->cntdepend and is called in a loop, calculating logs.prx_queue_pos
after this loop completes will yield incorrect values.

>
> Now I think I understand how the cntdepend works : isn't it simply a
> wrapping queue index that you use to measure the difference between
> when you queued and when you dequeued ? In this case, wouldn't something
> like "queue_idx" be more explicit ? At least it becomes more obvious when
> doing the measure that we measure a length by subtracting two indexes.
It is a measure of a difference yes, but it's the difference between how
many items were processed off the queue. I personally wouldn't call it
an index as index implies position. `px->cntdepend` isn't a position of
anything, but is a counter of processed connections. `strm->cntdepend`
isn't a position either, but a snapshot of the px->cntdepend value when
it was queued. When you combine the 2 you get a position.

>
> I'm going back to trying to add some detailed timestamps at some points
> in order to figure if we're spending a growing time in the queue or
> anywhere else regarding this strange timer variation thing.
>
> Cheers,
> Willy

Did you have any thoughts on the 64-bit absolute timestamp patch? It was
my understanding that the earlier objections were due to anticipated
performance and complexity issues. But the performance is better, the
code is simpler, and has less issues (queue timeout, wrapping at extreme
offset values, etc).

I'll try to get to submitting a revised patch set in the next couple days.

-Patrick
Hi Patrick,

On Thu, May 31, 2018 at 12:16:27AM -0400, Patrick Hemmer wrote:
> > I looked at the code to see if something could cause that. I found that the
> > key increment could be a reason (you must restart from the next element,
> > not an upper value since there will be many duplicate keys)
> Gah, I completely forgot about duplicate keys. Will fix.

I'll send you (later today) some splitted patches that will help for
this. This simplifies review and experimentations on the walk algorithm.

> The warning is addressable. It means the user should add a `timeout
> queue` and set it less than 524287ms. This is similar to the "missing
> timeouts for frontend/backend" warning we print. Though I think it would
> make sense to add "queue" to that existing warning instead, as it
> already mentions the timeouts for client, connect & server.

In fact the problem I'm having is that I've already seen some very high
timeout values in field (eg on a print server), which means that some
people do need to have a large value here. For sure they don't need to
reach 24 days but 8min will definitely be too short for certain extreme
cases. And I understand the impact of such large values with the queue,
I just don't know how we can address this at the moment, it's what I'm
still thinking about.

> > We need to find a more suitable name for this "cntdepend" as it really doesn't
> > tell me that it has anything to do with what it says. I don't have anything
> > to propose for now I'm afraid.
> Yeah, not fond of the name, but it was something, and is easy to change.
> The name originated from "cnt"=counter, and "pend" to be consistent with
> the naming of "nbpend" and "totpend" right above it. So: counter of
> de-queued pending connections.

OK. In fact given that it's a wrapping counter it's why I considered it
could be an index for the srv and px parts, but the stream part stores
a distance between the recorded index and the current index.

> > @@ -2467,6 +2469,10 @@ struct task *process_stream(struct task *t, void *context, unsigned short state)
> > return t; /* nothing more to do */
> > }
> >
> > + // remove from pending queue here so we update counters
> > + if (s->pend_pos)
> > + pendconn_free(s->pend_pos);
> > +
> > if (s->flags & SF_BE_ASSIGNED)
> > HA_ATOMIC_SUB(&s->be->beconn, 1);
> >
> > This part is not supposed to be needed since it's already performed
> > in stream_free() a few lines later. Or if it's required to get the
> > queue values correctly before logging, then something similar will
> > be required at every place where we log after trying to connect to
> > a server (there are a few places depending on the log format and
> > the logasap option).
> Yes, it was put there to update the counters before logging. Re-thinking
> this, updating of the counters needs to be moved into
> pendconn_process_next_stream anyway. Since this function updates
> px->cntdepend and is called in a loop, calculating logs.prx_queue_pos
> after this loop completes will yield incorrect values.

I thought about the same but was not yet completely certain. I suspect in
fact the stream's value should indeed be updated upon dequeue, and that
only the error case has to be taken into account before logging (if there
was no dequeue). In this case I think we can have this one performed here
provided we ensure other log places will not be triggered on error.

> > Now I think I understand how the cntdepend works : isn't it simply a
> > wrapping queue index that you use to measure the difference between
> > when you queued and when you dequeued ? In this case, wouldn't something
> > like "queue_idx" be more explicit ? At least it becomes more obvious when
> > doing the measure that we measure a length by subtracting two indexes.
> It is a measure of a difference yes, but it's the difference between how
> many items were processed off the queue. I personally wouldn't call it
> an index as index implies position.

I personally see it as a position. Well exactly a sequence number in fact :-)

> `px->cntdepend` isn't a position of
> anything, but is a counter of processed connections.

Except that it's a wrapping counter, and I'd hate to be tempted to use it as
a counter with the same meaning as what we use in stats.

> `strm->cntdepend`
> isn't a position either, but a snapshot of the px->cntdepend value when
> it was queued. When you combine the 2 you get a position.
(...)
> Did you have any thoughts on the 64-bit absolute timestamp patch? It was
> my understanding that the earlier objections were due to anticipated
> performance and complexity issues. But the performance is better, the
> code is simpler, and has less issues (queue timeout, wrapping at extreme
> offset values, etc).

Yes I had some thoughts but mixed. It's indeed simpler, but from the beginning
there is something I don't like in it that I'm totally unable to explain. I
really hate this because I know it feels totally unfair but I'm really
convinced that taking it this way we'll regret it later. I'm a bit embarrassed
and have been thinking for some time what bothers me but I don't know very
well. The root cause of the problem is changing the way the timers work.
Everywhere we use a perfectly fine wrapping timer that can run continuously
and ensures that bugs (if any) are quickly met and resolved. Here with a
non-wrapping time, we more or less consider that given that we'll all be
dead when it wraps we don't care about what happens. And this "don't care"
approach really doesn't appeal me because we may overlook serious problems
that will trigger much earlier than expected, or could depend on the platform
or anything. You know, a bit like people using "9/9/99" as infinite time 25
years ago, until their applications stopped on this date, or long-term
contracts were dropped or anything.

I don't like the extra cost of 64-bit processing for 32-bit platforms but I
also know that we care much less about 32-bit platforms when it comes to
performance, so if technically we need it I'm OK with it. I don't either
like the *1000 and /1000 done there, adding a (small) cost to the processing
but that can be addressed as well if we really need this. I just really want
to ensure everything works fine by construction and not just because we're
in a certain window of time which ensures that numbers assemble well together.

> I'll try to get to submitting a revised patch set in the next couple days.

I'll send you privately the split-up of your patch set that I did yesterday,
it will help you and will be better for me once you send it back. I can even
merge some parts of it already, those which are probably definitive.

Thanks,
Willy
On 2018/5/31 00:57, Willy Tarreau wrote:
> Hi Patrick,
>
> On Thu, May 31, 2018 at 12:16:27AM -0400, Patrick Hemmer wrote:
>>> I looked at the code to see if something could cause that. I found that the
>>> key increment could be a reason (you must restart from the next element,
>>> not an upper value since there will be many duplicate keys)
>> Gah, I completely forgot about duplicate keys. Will fix.
> I'll send you (later today) some splitted patches that will help for
> this. This simplifies review and experimentations on the walk algorithm.
I kept the patch set you created, and have made some minor adjustments
to address the mentioned issues.

I think the only uncertainty I had during implementation was an atomic
operator for retrieving the queue_idx value inside
stream_process_counters. There's no load operator, but I'm not sure if
that is by design, or what the preferred solution is here. In theory if
the value is aligned, the load is atomic anyway and we don't need an
explicit atomic operator. But I'm not sure if that's an assumption we
want to make.

>> The warning is addressable. It means the user should add a `timeout
>> queue` and set it less than 524287ms. This is similar to the "missing
>> timeouts for frontend/backend" warning we print. Though I think it would
>> make sense to add "queue" to that existing warning instead, as it
>> already mentions the timeouts for client, connect & server.
> In fact the problem I'm having is that I've already seen some very high
> timeout values in field (eg on a print server), which means that some
> people do need to have a large value here. For sure they don't need to
> reach 24 days but 8min will definitely be too short for certain extreme
> cases. And I understand the impact of such large values with the queue,
> I just don't know how we can address this at the moment, it's what I'm
> still thinking about.
>
>>> We need to find a more suitable name for this "cntdepend" as it really doesn't
>>> tell me that it has anything to do with what it says. I don't have anything
>>> to propose for now I'm afraid.
>> Yeah, not fond of the name, but it was something, and is easy to change.
>> The name originated from "cnt"=counter, and "pend" to be consistent with
>> the naming of "nbpend" and "totpend" right above it. So: counter of
>> de-queued pending connections.
> OK. In fact given that it's a wrapping counter it's why I considered it
> could be an index for the srv and px parts, but the stream part stores
> a distance between the recorded index and the current index.
>
>>> @@ -2467,6 +2469,10 @@ struct task *process_stream(struct task *t, void *context, unsigned short state)
>>> return t; /* nothing more to do */
>>> }
>>>
>>> + // remove from pending queue here so we update counters
>>> + if (s->pend_pos)
>>> + pendconn_free(s->pend_pos);
>>> +
>>> if (s->flags & SF_BE_ASSIGNED)
>>> HA_ATOMIC_SUB(&s->be->beconn, 1);
>>>
>>> This part is not supposed to be needed since it's already performed
>>> in stream_free() a few lines later. Or if it's required to get the
>>> queue values correctly before logging, then something similar will
>>> be required at every place where we log after trying to connect to
>>> a server (there are a few places depending on the log format and
>>> the logasap option).
>> Yes, it was put there to update the counters before logging. Re-thinking
>> this, updating of the counters needs to be moved into
>> pendconn_process_next_stream anyway. Since this function updates
>> px->cntdepend and is called in a loop, calculating logs.prx_queue_pos
>> after this loop completes will yield incorrect values.
> I thought about the same but was not yet completely certain. I suspect in
> fact the stream's value should indeed be updated upon dequeue, and that
> only the error case has to be taken into account before logging (if there
> was no dequeue). In this case I think we can have this one performed here
> provided we ensure other log places will not be triggered on error.
>
>>> Now I think I understand how the cntdepend works : isn't it simply a
>>> wrapping queue index that you use to measure the difference between
>>> when you queued and when you dequeued ? In this case, wouldn't something
>>> like "queue_idx" be more explicit ? At least it becomes more obvious when
>>> doing the measure that we measure a length by subtracting two indexes.
>> It is a measure of a difference yes, but it's the difference between how
>> many items were processed off the queue. I personally wouldn't call it
>> an index as index implies position.
> I personally see it as a position. Well exactly a sequence number in fact :-)
If you are waiting in the line at the cinema, and when you get in the
queue you're given a number of how many people the cinema has served
since it opened, that's not your position in the queue. Just as when
you're served, and are given a new number of how many people have been
served, it's still not your queue position. And then if you take the
difference between the two, that tells you how many people were served
while you were waiting. But it doesn't mean that's how many people were
in front of you when you joined the queue, as people could have been
inserted before you after you joined.

But it's just a name, and the description in the header file is
accurate, so I'll keep the name you changed it to.

>> Did you have any thoughts on the 64-bit absolute timestamp patch? It was
>> my understanding that the earlier objections were due to anticipated
>> performance and complexity issues. But the performance is better, the
>> code is simpler, and has less issues (queue timeout, wrapping at extreme
>> offset values, etc).
> Yes I had some thoughts but mixed. It's indeed simpler, but from the beginning
> there is something I don't like in it that I'm totally unable to explain. I
> really hate this because I know it feels totally unfair but I'm really
> convinced that taking it this way we'll regret it later. I'm a bit embarrassed
> and have been thinking for some time what bothers me but I don't know very
> well. The root cause of the problem is changing the way the timers work.
> Everywhere we use a perfectly fine wrapping timer that can run continuously
> and ensures that bugs (if any) are quickly met and resolved.
Except that in the priority queue case, the wrapping times don't work
perfectly fine. Anything which sits in the queue long enough to wrap is
a problem.
> Here with a
> non-wrapping time, we more or less consider that given that we'll all be
> dead when it wraps we don't care about what happens. And this "don't care"
> approach really doesn't appeal me because we may overlook serious problems
> that will trigger much earlier than expected, or could depend on the platform
> or anything. You know, a bit like people using "9/9/99" as infinite time 25
> years ago, until their applications stopped on this date, or long-term
> contracts were dropped or anything.
There's a difference though in that we don't allow the user to enter an
absolute timestamp. The configuration they enter is relative. Yes it's
possible they enter a priority offset of X years, where after some date,
X+now() wraps around, but we can mitigate that by using a time since
haproxy start instead of time since unix epoch. Then if we limit max/min
offset to restrict a 1000 years from the max/min 52-bit values, haproxy
would have to run for 1000 years continuous before any issues arise.

-Patrick

From ec45b0b4a5321f0c1aa5ebbb19ecb5d1dd46776f Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 1/6] MINOR: stream: rename {srv,prx}_queue_size to *_queue_pos

The current name is misleading as it indicates a position in the queue
but pretends to be a queue size. It's only the queue size at the exact
moment the element is enqueued but this will not be true anymore once
we start inserting anywhere in the queue.
---
include/types/stream.h | 4 ++--
src/log.c | 4 ++--
src/proto_http.c | 4 ++--
src/queue.c | 4 ++--
src/stream.c | 4 ++--
5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/types/stream.h b/include/types/stream.h
index 0dbc79f44..a3137bf31 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -105,8 +105,8 @@ struct strm_logs {
long t_connect; /* delay before the connect() to the server succeeds, -1 if never occurs */
long t_data; /* delay before the first data byte from the server ... */
unsigned long t_close; /* total stream duration */
- unsigned long srv_queue_size; /* number of streams waiting for a connect slot on this server at accept() time (in direct assignment) */
- unsigned long prx_queue_size; /* overall number of streams waiting for a connect slot on this instance at accept() time */
+ unsigned long srv_queue_pos; /* number of streams de-queued while waiting for a connection slot on this server */
+ unsigned long prx_queue_pos; /* number of streams de-qeuued while waiting for a connection slot on this instance */
long long bytes_in; /* number of bytes transferred from the client to the server */
long long bytes_out; /* number of bytes transferred from the server to the client */
};
diff --git a/src/log.c b/src/log.c
index b2d4367f4..6ddfbd6c3 100644
--- a/src/log.c
+++ b/src/log.c
@@ -2131,7 +2131,7 @@ int build_logline(struct stream *s, char *dst, size_t maxsize, struct list *list
break;

case LOG_FMT_SRVQUEUE: // %sq
- ret = ltoa_o(s->logs.srv_queue_size, tmplog, dst + maxsize - tmplog);
+ ret = ltoa_o(s->logs.srv_queue_pos, tmplog, dst + maxsize - tmplog);
if (ret == NULL)
goto out;
tmplog = ret;
@@ -2139,7 +2139,7 @@ int build_logline(struct stream *s, char *dst, size_t maxsize, struct list *list
break;

case LOG_FMT_BCKQUEUE: // %bq
- ret = ltoa_o(s->logs.prx_queue_size, tmplog, dst + maxsize - tmplog);
+ ret = ltoa_o(s->logs.prx_queue_pos, tmplog, dst + maxsize - tmplog);
if (ret == NULL)
goto out;
tmplog = ret;
diff --git a/src/proto_http.c b/src/proto_http.c
index 4fd5aeb15..26bebaef7 100644
--- a/src/proto_http.c
+++ b/src/proto_http.c
@@ -4373,8 +4373,8 @@ void http_end_txn_clean_session(struct stream *s)
s->logs.t_connect = -1;
s->logs.t_data = -1;
s->logs.t_close = 0;
- s->logs.prx_queue_size = 0; /* we get the number of pending conns before us */
- s->logs.srv_queue_size = 0; /* we will get this number soon */
+ s->logs.prx_queue_pos = 0; /* we get the number of pending conns before us */
+ s->logs.srv_queue_pos = 0; /* we will get this number soon */

s->logs.bytes_in = s->req.total = s->req.buf->i;
s->logs.bytes_out = s->res.total = s->res.buf->i;
diff --git a/src/queue.c b/src/queue.c
index 1c730c75c..558268c48 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -216,7 +216,7 @@ struct pendconn *pendconn_add(struct stream *strm)
p->srv = srv;
HA_SPIN_LOCK(SERVER_LOCK, &srv->lock);
srv->nbpend++;
- strm->logs.srv_queue_size += srv->nbpend;
+ strm->logs.srv_queue_pos += srv->nbpend;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
LIST_ADDQ(&srv->pendconns, &p->list);
@@ -225,7 +225,7 @@ struct pendconn *pendconn_add(struct stream *strm)
else {
HA_SPIN_LOCK(PROXY_LOCK, &px->lock);
px->nbpend++;
- strm->logs.prx_queue_size += px->nbpend;
+ strm->logs.prx_queue_pos += px->nbpend;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
LIST_ADDQ(&px->pendconns, &p->list);
diff --git a/src/stream.c b/src/stream.c
index 9fdf6621b..f47a7cffa 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -125,8 +125,8 @@ struct stream *stream_new(struct session *sess, enum obj_type *origin)
s->logs.t_data = -1;
s->logs.t_close = 0;
s->logs.bytes_in = s->logs.bytes_out = 0;
- s->logs.prx_queue_size = 0; /* we get the number of pending conns before us */
- s->logs.srv_queue_size = 0; /* we will get this number soon */
+ s->logs.prx_queue_pos = 0; /* we get the number of pending conns before us */
+ s->logs.srv_queue_pos = 0; /* we will get this number soon */

/* default logging function */
s->do_log = strm_log;
--
2.16.3

From afda6fb76464878d21f70e5a19ade9fca619b913 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 2/6] WIP/MINOR: queue: store the queue index in the stream
when enqueuing

We store the queue index in the stream and check in on dequeueing to
figure how many entries were processed in between. This way we'll be
able to count the elements that may later be added before ours.

WARNING! a call to pendconn_free() and/or something similar is missing
in front of a number of calls to do_log() in order to report the correct
dequeue value, especially in case of abort or timeout.
---
include/types/proxy.h | 1 +
include/types/server.h | 1 +
include/types/stream.h | 1 +
src/queue.c | 21 +++++++++++++++------
src/stream.c | 13 +++++++++++++
5 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/types/proxy.h b/include/types/proxy.h
index 16c13a1c1..a233cdb48 100644
--- a/include/types/proxy.h
+++ b/include/types/proxy.h
@@ -325,6 +325,7 @@ struct proxy {
struct list pendconns; /* pending connections with no server assigned yet */
int nbpend; /* number of pending connections with no server assigned yet */
int totpend; /* total number of pending connections on this instance (for stats) */
+ unsigned int queue_idx; /* number of pending connections which have been de-queued */
unsigned int feconn, beconn; /* # of active frontend and backends streams */
struct freq_ctr fe_req_per_sec; /* HTTP requests per second on the frontend */
struct freq_ctr fe_conn_per_sec; /* received connections per second on the frontend */
diff --git a/include/types/server.h b/include/types/server.h
index 0cd20c096..85c55a269 100644
--- a/include/types/server.h
+++ b/include/types/server.h
@@ -209,6 +209,7 @@ struct server {
int cur_sess; /* number of currently active sessions (including syn_sent) */
unsigned maxconn, minconn; /* max # of active sessions (0 = unlimited), min# for dynamic limit. */
int nbpend; /* number of pending connections */
+ unsigned int queue_idx; /* count of pending connections which have been de-queued */
int maxqueue; /* maximum number of pending connections allowed */
struct freq_ctr sess_per_sec; /* sessions per second on this server */
struct be_counters counters; /* statistics counters */
diff --git a/include/types/stream.h b/include/types/stream.h
index a3137bf31..8b600e306 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -125,6 +125,7 @@ struct stream {

struct server *srv_conn; /* stream already has a slot on a server and is not in queue */
struct pendconn *pend_pos; /* if not NULL, points to the pending position in the pending queue */
+ unsigned int queue_idx; /* value of proxy/server queue_idx at time of enqueue */

struct http_txn *txn; /* current HTTP transaction being processed. Should become a list. */

diff --git a/src/queue.c b/src/queue.c
index 558268c48..43bbcf74f 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -71,10 +71,13 @@ unsigned int srv_dynamic_maxconn(const struct server *s)
*/
static void pendconn_unlink(struct pendconn *p)
{
- if (p->srv)
+ if (p->srv) {
+ p->strm->logs.srv_queue_pos += p->srv->queue_idx - p->strm->queue_idx;
p->srv->nbpend--;
- else
+ } else {
+ p->strm->logs.prx_queue_pos += p->px->queue_idx - p->strm->queue_idx;
p->px->nbpend--;
+ }
HA_ATOMIC_SUB(&p->px->totpend, 1);
LIST_DEL(&p->list);
LIST_INIT(&p->list);
@@ -101,6 +104,7 @@ static void pendconn_unlink(struct pendconn *p)
static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
{
struct pendconn *p = NULL;
+ struct pendconn *pp = NULL;
struct server *rsrv;
int remote;

@@ -118,8 +122,6 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)

ps_found:
if (srv_currently_usable(rsrv) && px->nbpend) {
- struct pendconn *pp;
-
list_for_each_entry(pp, &px->pendconns, list) {
/* If the server pendconn is older than the proxy one,
* we process the server one. */
@@ -146,6 +148,11 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
p->strm_flags |= SF_ASSIGNED;
p->srv = srv;

+ if (p != pp)
+ srv->queue_idx++;
+ else
+ px->queue_idx++;
+
HA_ATOMIC_ADD(&srv->served, 1);
HA_ATOMIC_ADD(&srv->proxy->served, 1);
if (px->lbprm.server_take_conn)
@@ -216,7 +223,7 @@ struct pendconn *pendconn_add(struct stream *strm)
p->srv = srv;
HA_SPIN_LOCK(SERVER_LOCK, &srv->lock);
srv->nbpend++;
- strm->logs.srv_queue_pos += srv->nbpend;
+ strm->queue_idx = srv->queue_idx;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
LIST_ADDQ(&srv->pendconns, &p->list);
@@ -225,7 +232,7 @@ struct pendconn *pendconn_add(struct stream *strm)
else {
HA_SPIN_LOCK(PROXY_LOCK, &px->lock);
px->nbpend++;
- strm->logs.prx_queue_pos += px->nbpend;
+ strm->queue_idx = px->queue_idx;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
LIST_ADDQ(&px->pendconns, &p->list);
@@ -374,12 +381,14 @@ void pendconn_free(struct pendconn *p)

if (p->srv) {
HA_SPIN_LOCK(SERVER_LOCK, &p->srv->lock);
+ p->strm->logs.srv_queue_pos += p->srv->queue_idx - p->strm->queue_idx;
p->srv->nbpend--;
LIST_DEL(&p->list);
HA_SPIN_UNLOCK(SERVER_LOCK, &p->srv->lock);
}
else {
HA_SPIN_LOCK(PROXY_LOCK, &p->px->lock);
+ p->strm->logs.prx_queue_pos += p->px->queue_idx - p->strm->queue_idx;
p->px->nbpend--;
LIST_DEL(&p->list);
HA_SPIN_UNLOCK(PROXY_LOCK, &p->px->lock);
diff --git a/src/stream.c b/src/stream.c
index f47a7cffa..d1a5b425d 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -488,6 +488,7 @@ void stream_process_counters(struct stream *s)
void *ptr1,*ptr2;
struct stksess *ts;
int i;
+ unsigned int queue_idx;

bytes = s->req.total - s->logs.bytes_in;
s->logs.bytes_in = s->req.total;
@@ -568,6 +569,18 @@ void stream_process_counters(struct stream *s)
stktable_touch_local(stkctr->table, stkctr_entry(stkctr), 0);
}
}
+
+ if (s->pend_pos && !LIST_ISEMPTY(&s->pend_pos->list)) {
+ if (s->pend_pos->srv) {
+ queue_idx = HA_ATOMIC_ADD(&s->pend_pos->srv->queue_idx, 0);
+ s->logs.srv_queue_pos += queue_idx - s->queue_idx;
+ s->queue_idx = queue_idx;
+ } else {
+ queue_idx = HA_ATOMIC_ADD(&s->pend_pos->px->queue_idx, 0);
+ s->logs.prx_queue_pos += queue_idx - s->queue_idx;
+ s->queue_idx = queue_idx;
+ }
+ }
}

/* This function is called with (si->state == SI_ST_CON) meaning that a
--
2.16.3

From 00310093e711a32a1b7598398eb1f84d92fd37d5 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 3/6] MINOR: queue: replace the linked list with a tree

We'll need trees to manage the queues by priorities. This change replaces
the list with a tree based on a single key. It's effectively a list but
allows us to get rid of the list management right now.
---
include/types/proxy.h | 2 +-
include/types/queue.h | 2 +-
include/types/server.h | 2 +-
src/hlua.c | 4 ++--
src/proxy.c | 2 +-
src/queue.c | 44 ++++++++++++++++++++++++++++++--------------
src/server.c | 2 +-
src/stream.c | 2 +-
8 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/include/types/proxy.h b/include/types/proxy.h
index a233cdb48..dcd7982c3 100644
--- a/include/types/proxy.h
+++ b/include/types/proxy.h
@@ -322,7 +322,7 @@ struct proxy {
int serverfin; /* timeout to apply to server half-closed connections */
} timeout;
char *id, *desc; /* proxy id (name) and description */
- struct list pendconns; /* pending connections with no server assigned yet */
+ struct eb_root pendconns; /* pending connections with no server assigned yet */
int nbpend; /* number of pending connections with no server assigned yet */
int totpend; /* total number of pending connections on this instance (for stats) */
unsigned int queue_idx; /* number of pending connections which have been de-queued */
diff --git a/include/types/queue.h b/include/types/queue.h
index 42dbbd047..03377da69 100644
--- a/include/types/queue.h
+++ b/include/types/queue.h
@@ -35,7 +35,7 @@ struct pendconn {
struct stream *strm;
struct proxy *px;
struct server *srv; /* the server we are waiting for, may be NULL */
- struct list list; /* next pendconn */
+ struct eb32_node node;
__decl_hathreads(HA_SPINLOCK_T lock);
};

diff --git a/include/types/server.h b/include/types/server.h
index 85c55a269..756e136c4 100644
--- a/include/types/server.h
+++ b/include/types/server.h
@@ -214,7 +214,7 @@ struct server {
struct freq_ctr sess_per_sec; /* sessions per second on this server */
struct be_counters counters; /* statistics counters */

- struct list pendconns; /* pending connections */
+ struct eb_root pendconns; /* pending connections */
struct list actconns; /* active connections */
struct list *priv_conns; /* private idle connections attached to stream interfaces */
struct list *idle_conns; /* sharable idle connections attached or not to a stream interface */
diff --git a/src/hlua.c b/src/hlua.c
index 2f7fe9960..9a3969886 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -7940,7 +7940,7 @@ void hlua_init(void)
socket_tcp.proxy = &socket_proxy;
socket_tcp.obj_type = OBJ_TYPE_SERVER;
LIST_INIT(&socket_tcp.actconns);
- LIST_INIT(&socket_tcp.pendconns);
+ socket_tcp.pendconns = EB_ROOT;
socket_tcp.priv_conns = NULL;
socket_tcp.idle_conns = NULL;
socket_tcp.safe_conns = NULL;
@@ -7986,7 +7986,7 @@ void hlua_init(void)
socket_ssl.proxy = &socket_proxy;
socket_ssl.obj_type = OBJ_TYPE_SERVER;
LIST_INIT(&socket_ssl.actconns);
- LIST_INIT(&socket_ssl.pendconns);
+ socket_ssl.pendconns = EB_ROOT;
socket_tcp.priv_conns = NULL;
socket_tcp.idle_conns = NULL;
socket_tcp.safe_conns = NULL;
diff --git a/src/proxy.c b/src/proxy.c
index c262966a2..26bdd464a 100644
--- a/src/proxy.c
+++ b/src/proxy.c
@@ -726,7 +726,7 @@ void init_new_proxy(struct proxy *p)
{
memset(p, 0, sizeof(struct proxy));
p->obj_type = OBJ_TYPE_PROXY;
- LIST_INIT(&p->pendconns);
+ p->pendconns = EB_ROOT;
LIST_INIT(&p->acl);
LIST_INIT(&p->http_req_rules);
LIST_INIT(&p->http_res_rules);
diff --git a/src/queue.c b/src/queue.c
index 43bbcf74f..3f2e0bdcd 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -14,6 +14,7 @@
#include <common/memory.h>
#include <common/time.h>
#include <common/hathreads.h>
+#include <eb32tree.h>

#include <proto/queue.h>
#include <proto/server.h>
@@ -79,8 +80,7 @@ static void pendconn_unlink(struct pendconn *p)
p->px->nbpend--;
}
HA_ATOMIC_SUB(&p->px->totpend, 1);
- LIST_DEL(&p->list);
- LIST_INIT(&p->list);
+ eb32_delete(&p->node);
}

/* Process the next pending connection from either a server or a proxy, and
@@ -106,6 +106,7 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
struct pendconn *p = NULL;
struct pendconn *pp = NULL;
struct server *rsrv;
+ struct eb32_node *node;
int remote;

rsrv = srv->track;
@@ -113,7 +114,10 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
rsrv = srv;

if (srv->nbpend) {
- list_for_each_entry(p, &srv->pendconns, list) {
+ for (node = eb32_first(&srv->pendconns);
+ node;
+ node = eb32_next(node)) {
+ p = eb32_entry(node, struct pendconn, node);
if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &p->lock))
goto ps_found;
}
@@ -122,7 +126,10 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)

ps_found:
if (srv_currently_usable(rsrv) && px->nbpend) {
- list_for_each_entry(pp, &px->pendconns, list) {
+ for (node = eb32_first(&px->pendconns);
+ node;
+ node = eb32_next(node)) {
+ pp = eb32_entry(node, struct pendconn, node);
/* If the server pendconn is older than the proxy one,
* we process the server one. */
if (p && !tv_islt(&pp->strm->logs.tv_request, &p->strm->logs.tv_request))
@@ -213,6 +220,7 @@ struct pendconn *pendconn_add(struct stream *strm)
srv = objt_server(strm->target);
px = strm->be;

+ p->node.key = 0;
p->srv = NULL;
p->px = px;
p->strm = strm;
@@ -226,7 +234,7 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->queue_idx = srv->queue_idx;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
- LIST_ADDQ(&srv->pendconns, &p->list);
+ eb32_insert(&srv->pendconns, &p->node);
HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
}
else {
@@ -235,7 +243,7 @@ struct pendconn *pendconn_add(struct stream *strm)
strm->queue_idx = px->queue_idx;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
- LIST_ADDQ(&px->pendconns, &p->list);
+ eb32_insert(&px->pendconns, &p->node);
HA_SPIN_UNLOCK(PROXY_LOCK, &px->lock);
}
HA_ATOMIC_ADD(&px->totpend, 1);
@@ -248,7 +256,8 @@ struct pendconn *pendconn_add(struct stream *strm)
*/
int pendconn_redistribute(struct server *s)
{
- struct pendconn *p, *pback;
+ struct pendconn *p;
+ struct eb32_node *node;
int xferred = 0;
int remote = 0;

@@ -258,7 +267,10 @@ int pendconn_redistribute(struct server *s)
return 0;

HA_SPIN_LOCK(SERVER_LOCK, &s->lock);
- list_for_each_entry_safe(p, pback, &s->pendconns, list) {
+ for (node = eb32_first(&s->pendconns);
+ node;
+ node = eb32_next(node)) {
+ p = eb32_entry(&node, struct pendconn, node);
if (p->strm_flags & SF_FORCE_PRST)
continue;

@@ -287,7 +299,8 @@ int pendconn_redistribute(struct server *s)
*/
int pendconn_grab_from_px(struct server *s)
{
- struct pendconn *p, *pback;
+ struct pendconn *p;
+ struct eb32_node *node;
int maxconn, xferred = 0;
int remote = 0;

@@ -296,7 +309,10 @@ int pendconn_grab_from_px(struct server *s)

HA_SPIN_LOCK(PROXY_LOCK, &s->proxy->lock);
maxconn = srv_dynamic_maxconn(s);
- list_for_each_entry_safe(p, pback, &s->proxy->pendconns, list) {
+ for (node = eb32_first(&s->proxy->pendconns);
+ node;
+ node = eb32_next(node)) {
+ p = eb32_entry(&node, struct pendconn, node);
if (s->maxconn && s->served + xferred >= maxconn)
break;

@@ -344,7 +360,7 @@ int pendconn_dequeue(struct stream *strm)

/* the pendconn is still linked to the server/proxy queue, so unlock it
* and go away. */
- if (!LIST_ISEMPTY(&p->list)) {
+ if (p->node.node.leaf_p) {
HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
return 1;
}
@@ -376,21 +392,21 @@ void pendconn_free(struct pendconn *p)
HA_SPIN_LOCK(PENDCONN_LOCK, &p->lock);

/* The pendconn was already unlinked, just release it. */
- if (LIST_ISEMPTY(&p->list))
+ if (!p->node.node.leaf_p)
goto release;

if (p->srv) {
HA_SPIN_LOCK(SERVER_LOCK, &p->srv->lock);
p->strm->logs.srv_queue_pos += p->srv->queue_idx - p->strm->queue_idx;
p->srv->nbpend--;
- LIST_DEL(&p->list);
+ eb32_delete(&p->node);
HA_SPIN_UNLOCK(SERVER_LOCK, &p->srv->lock);
}
else {
HA_SPIN_LOCK(PROXY_LOCK, &p->px->lock);
p->strm->logs.prx_queue_pos += p->px->queue_idx - p->strm->queue_idx;
p->px->nbpend--;
- LIST_DEL(&p->list);
+ eb32_delete(&p->node);
HA_SPIN_UNLOCK(PROXY_LOCK, &p->px->lock);
}
HA_ATOMIC_SUB(&p->px->totpend, 1);
diff --git a/src/server.c b/src/server.c
index 277d1405e..36a7ea1d2 100644
--- a/src/server.c
+++ b/src/server.c
@@ -1575,7 +1575,7 @@ static struct server *new_server(struct proxy *proxy)
srv->obj_type = OBJ_TYPE_SERVER;
srv->proxy = proxy;
LIST_INIT(&srv->actconns);
- LIST_INIT(&srv->pendconns);
+ srv->pendconns = EB_ROOT;

if ((srv->priv_conns = calloc(global.nbthread, sizeof(*srv->priv_conns))) == NULL)
goto free_srv;
diff --git a/src/stream.c b/src/stream.c
index d1a5b425d..f774406dd 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -570,7 +570,7 @@ void stream_process_counters(struct stream *s)
}
}

- if (s->pend_pos && !LIST_ISEMPTY(&s->pend_pos->list)) {
+ if (s->pend_pos && s->pend_pos->node.node.leaf_p) {
if (s->pend_pos->srv) {
queue_idx = HA_ATOMIC_ADD(&s->pend_pos->srv->queue_idx, 0);
s->logs.srv_queue_pos += queue_idx - s->queue_idx;
--
2.16.3

From 0a47fc71c1a2fc029b41bbdf1b7c9329c294ef26 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 4/6] MEDIUM: add set-priority-class and set-priority-offset

This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content. At this point they are not used
yet, which is the purpose of the next commit, but all the logic to
set and clear the values is there.
---
doc/configuration.txt | 38 ++++++++++++++
doc/lua-api/index.rst | 18 +++++++
include/proto/queue.h | 19 +++++++
include/types/stream.h | 3 ++
src/hlua.c | 53 +++++++++++++------
src/queue.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++
src/stream.c | 2 +
7 files changed, 255 insertions(+), 15 deletions(-)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index e901d7eea..0a0d9a038 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -3917,6 +3917,7 @@ http-request { allow | auth [realm <realm>] | redirect <rule> | reject |
replace-value <name> <match-regex> <replace-fmt> |
set-method <fmt> | set-path <fmt> | set-query <fmt> |
set-uri <fmt> | set-tos <tos> | set-mark <mark> |
+ set-priority-class <expr> | set-priority-offset <expr>
add-acl(<file name>) <key fmt> |
del-acl(<file name>) <key fmt> |
del-map(<file name>) <key fmt> |
@@ -4113,6 +4114,24 @@ http-request { allow | auth [realm <realm>] | redirect <rule> | reject |
downloads). This works on Linux kernels 2.6.32 and above and requires
admin privileges.

+ - "set-priority-class" is used to set the queue priority class of the
+ current request. The value must be a sample expression which converts to
+ an integer in the range -2047..2047. Results outside this range will be
+ truncated. The priority class determines the order in which queued
+ requests are processed. Lower values have higher priority.
+
+ - "set-priority-offset" is used to set the queue priority timestamp offset
+ of the current request. The value must be a sample expression which
+ converts to an integer in the range -524287..524287. Results outside this
+ range will be truncated. When a request is queued, it is ordered first by
+ the priority class, then by the current timestamp adjusted by the given
+ offset in milliseconds. Lower values have higher priority.
+ Note that the resulting timestamp is is only tracked with enough precision
+ for 524,287ms (8m44s287ms). If the request is queued long enough to where
+ the adjusted timestamp exceeds this value, it will be misidentified as
+ highest priority. Thus it is important to set "timeout queue" to a value,
+ where when combined with the offset, does not exceed this limit.
+
- "add-acl" is used to add a new entry into an ACL. The ACL must be loaded
from a file (even a dummy empty file). The file name of the ACL to be
updated is passed between parentheses. It takes one argument: <key fmt>,
@@ -9452,6 +9471,7 @@ tcp-request content <action> [{if | unless} <condition>]
- accept : the request is accepted
- reject : the request is rejected and the connection is closed
- capture : the specified sample expression is captured
+ - set-priority-class <expr> | set-priority-offset <expr>
- { track-sc0 | track-sc1 | track-sc2 } <key> [table <table>]
- sc-inc-gpc0(<sc-id>)
- sc-inc-gpc1(<sc-id>)
@@ -9513,6 +9533,24 @@ tcp-request content <action> [{if | unless} <condition>]
The "unset-var" is used to unset a variable. See above for details about
<var-name>.

+ The "set-priority-class" is used to set the queue priority class of the
+ current request. The value must be a sample expression which converts to an
+ integer in the range -2047..2047. Results outside this range will be
+ truncated. The priority class determines the order in which queued requests
+ are processed. Lower values have higher priority.
+
+ The "set-priority-offset" is used to set the queue priority timestamp offset
+ of the current request. The value must be a sample expression which converts
+ to an integer in the range -524287..524287. Results outside this range will be
+ truncated. When a request is queued, it is ordered first by the priority
+ class, then by the current timestamp adjusted by the given offset in
+ milliseconds. Lower values have higher priority.
+ Note that the resulting timestamp is is only tracked with enough precision for
+ 524,287ms (8m44s287ms). If the request is queued long enough to where the
+ adjusted timestamp exceeds this value, it will be misidentified as highest
+ priority. Thus it is important to set "timeout queue" to a value, where when
+ combined with the offset, does not exceed this limit.
+
The "send-spoe-group" is used to trigger sending of a group of SPOE
messages. To do so, the SPOE engine used to send messages must be defined, as
well as the SPOE group to send. Of course, the SPOE engine must refer to an
diff --git a/doc/lua-api/index.rst b/doc/lua-api/index.rst
index ee9dab55c..0c79766eb 100644
--- a/doc/lua-api/index.rst
+++ b/doc/lua-api/index.rst
@@ -1769,6 +1769,24 @@ TXN class
:param class_txn txn: The class txn object containing the data.
:param integer mark: The mark value.

+.. js:function:: TXN.set_priority_class(txn, prio)
+
+ This function adjusts the priority class of the transaction. The value should
+ be within the range -2047..2047. Values outside this range will be
+ truncated.
+
+ See the HAProxy configuration.txt file keyword "http-request" action
+ "set-priority-class" for details.
+
+.. js:function:: TXN.set_priority_offset(txn, prio)
+
+ This function adjusts the priority offset of the transaction. The value
+ should be within the range -524287..524287. Values outside this range will be
+ truncated.
+
+ See the HAProxy configuration.txt file keyword "http-request" action
+ "set-priority-offset" for details.
+
.. _socket_class:

Socket class
diff --git a/include/proto/queue.h b/include/proto/queue.h
index 2d4773a09..27373b960 100644
--- a/include/proto/queue.h
+++ b/include/proto/queue.h
@@ -59,6 +59,25 @@ static inline int may_dequeue_tasks(const struct server *s, const struct proxy *
(!s->maxconn || s->cur_sess < srv_dynamic_maxconn(s)));
}

+static inline int queue_limit_class(int class)
+{
+ if (class < -0x7ff)
+ return -0x7ff;
+ if (class > 0x7ff)
+ return 0x7ff;
+ return class;
+}
+
+static inline int queue_limit_offset(int offset)
+{
+ if (offset < -0x7ffff)
+ return -0x7ffff;
+ if (offset > 0x7ffff)
+ return 0x7ffff;
+ return offset;
+}
+
+
#endif /* _PROTO_QUEUE_H */

/*
diff --git a/include/types/stream.h b/include/types/stream.h
index 8b600e306..ff2b1aeb5 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -125,6 +125,7 @@ struct stream {

struct server *srv_conn; /* stream already has a slot on a server and is not in queue */
struct pendconn *pend_pos; /* if not NULL, points to the pending position in the pending queue */
+ int32_t priority_offset; /* priority offset of the stream for the pending queue */
unsigned int queue_idx; /* value of proxy/server queue_idx at time of enqueue */

struct http_txn *txn; /* current HTTP transaction being processed. Should become a list. */
@@ -132,6 +133,8 @@ struct stream {
struct task *task; /* the task associated with this stream */
unsigned short pending_events; /* the pending events not yet processed by the stream.
* This is a bit field of TASK_WOKEN_* */
+ int16_t priority_class; /* priority class of the stream for the pending queue */
+ /* still 32-bit hole here */

struct list list; /* position in global streams list */
struct list by_srv; /* position in server stream list */
diff --git a/src/hlua.c b/src/hlua.c
index 9a3969886..d08968379 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -44,6 +44,7 @@
#include <proto/hlua_fcn.h>
#include <proto/map.h>
#include <proto/obj_type.h>
+#include <proto/queue.h>
#include <proto/pattern.h>
#include <proto/payload.h>
#include <proto/proto_http.h>
@@ -5321,6 +5322,26 @@ __LJMP static int hlua_txn_set_mark(lua_State *L)
return 0;
}

+__LJMP static int hlua_txn_set_priority_class(lua_State *L)
+{
+ struct hlua_txn *htxn;
+
+ MAY_LJMP(check_args(L, 2, "set_priority_class"));
+ htxn = MAY_LJMP(hlua_checktxn(L, 1));
+ htxn->s->priority_class = queue_limit_class(MAY_LJMP(luaL_checkinteger(L, 2)));
+ return 0;
+}
+
+__LJMP static int hlua_txn_set_priority_offset(lua_State *L)
+{
+ struct hlua_txn *htxn;
+
+ MAY_LJMP(check_args(L, 2, "set_priority_offset"));
+ htxn = MAY_LJMP(hlua_checktxn(L, 1));
+ htxn->s->priority_offset = queue_limit_offset(MAY_LJMP(luaL_checkinteger(L, 2)));
+ return 0;
+}
+
/* This function is an Lua binding that send pending data
* to the client, and close the stream interface.
*/
@@ -7864,21 +7885,23 @@ void hlua_init(void)
lua_newtable(gL.T);

/* Register Lua functions. */
- hlua_class_function(gL.T, "set_priv", hlua_set_priv);
- hlua_class_function(gL.T, "get_priv", hlua_get_priv);
- hlua_class_function(gL.T, "set_var", hlua_set_var);
- hlua_class_function(gL.T, "unset_var", hlua_unset_var);
- hlua_class_function(gL.T, "get_var", hlua_get_var);
- hlua_class_function(gL.T, "done", hlua_txn_done);
- hlua_class_function(gL.T, "set_loglevel",hlua_txn_set_loglevel);
- hlua_class_function(gL.T, "set_tos", hlua_txn_set_tos);
- hlua_class_function(gL.T, "set_mark", hlua_txn_set_mark);
- hlua_class_function(gL.T, "deflog", hlua_txn_deflog);
- hlua_class_function(gL.T, "log", hlua_txn_log);
- hlua_class_function(gL.T, "Debug", hlua_txn_log_debug);
- hlua_class_function(gL.T, "Info", hlua_txn_log_info);
- hlua_class_function(gL.T, "Warning", hlua_txn_log_warning);
- hlua_class_function(gL.T, "Alert", hlua_txn_log_alert);
+ hlua_class_function(gL.T, "set_priv", hlua_set_priv);
+ hlua_class_function(gL.T, "get_priv", hlua_get_priv);
+ hlua_class_function(gL.T, "set_var", hlua_set_var);
+ hlua_class_function(gL.T, "unset_var", hlua_unset_var);
+ hlua_class_function(gL.T, "get_var", hlua_get_var);
+ hlua_class_function(gL.T, "done", hlua_txn_done);
+ hlua_class_function(gL.T, "set_loglevel", hlua_txn_set_loglevel);
+ hlua_class_function(gL.T, "set_tos", hlua_txn_set_tos);
+ hlua_class_function(gL.T, "set_mark", hlua_txn_set_mark);
+ hlua_class_function(gL.T, "set_priority_class", hlua_txn_set_priority_class);
+ hlua_class_function(gL.T, "set_priority_offset", hlua_txn_set_priority_offset);
+ hlua_class_function(gL.T, "deflog", hlua_txn_deflog);
+ hlua_class_function(gL.T, "log", hlua_txn_log);
+ hlua_class_function(gL.T, "Debug", hlua_txn_log_debug);
+ hlua_class_function(gL.T, "Info", hlua_txn_log_info);
+ hlua_class_function(gL.T, "Warning", hlua_txn_log_warning);
+ hlua_class_function(gL.T, "Alert", hlua_txn_log_alert);

lua_rawset(gL.T, -3);

diff --git a/src/queue.c b/src/queue.c
index 3f2e0bdcd..a92cea4b6 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -16,11 +16,14 @@
#include <common/hathreads.h>
#include <eb32tree.h>

+#include <proto/proto_http.h>
#include <proto/queue.h>
+#include <proto/sample.h>
#include <proto/server.h>
#include <proto/stream.h>
#include <proto/stream_interface.h>
#include <proto/task.h>
+#include <proto/tcp_rules.h>


struct pool_head *pool_head_pendconn;
@@ -417,6 +420,140 @@ void pendconn_free(struct pendconn *p)
pool_free(pool_head_pendconn, p);
}

+static enum act_return action_set_priority_class(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (!smp)
+ return ACT_RET_CONT;
+
+ s->priority_class = queue_limit_class(smp->data.u.sint);
+ return ACT_RET_CONT;
+}
+
+static enum act_return action_set_priority_offset(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (!smp)
+ return ACT_RET_CONT;
+
+ s->priority_offset = queue_limit_offset(smp->data.u.sint);
+
+ return ACT_RET_CONT;
+}
+
+static enum act_parse_ret parse_set_priority_class(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority_class;
+ return ACT_RET_PRS_OK;
+}
+
+static enum act_parse_ret parse_set_priority_offset(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority_offset;
+ return ACT_RET_PRS_OK;
+}
+
+static struct action_kw_list tcp_cont_kws = {ILH, {
+ { "set-priority-class", parse_set_priority_class },
+ { "set-priority-offset", parse_set_priority_offset },
+ { /* END */ }
+}};
+
+static struct action_kw_list http_req_kws = {ILH, {
+ { "set-priority-class", parse_set_priority_class },
+ { "set-priority-offset", parse_set_priority_offset },
+ { /* END */ }
+}};
+
+static int
+smp_fetch_priority_class(const struct arg *args, struct sample *smp, const char *kw, void *private)
+{
+ if (!smp->strm)
+ return 0;
+
+ smp->data.type = SMP_T_SINT;
+ smp->data.u.sint = smp->strm->priority_class;
+
+ return 1;
+}
+
+static int
+smp_fetch_priority_offset(const struct arg *args, struct sample *smp, const char *kw, void *private)
+{
+ if (!smp->strm)
+ return 0;
+
+ smp->data.type = SMP_T_SINT;
+ smp->data.u.sint = smp->strm->priority_offset;
+
+ return 1;
+}
+
+
+static struct sample_fetch_kw_list smp_kws = {ILH, {
+ { "prio_class", smp_fetch_priority_class, 0, NULL, SMP_T_SINT, SMP_USE_INTRN, },
+ { "prio_offset", smp_fetch_priority_offset, 0, NULL, SMP_T_SINT, SMP_USE_INTRN, },
+ { /* END */},
+}};
+
+__attribute__((constructor))
+static void __queue_init(void)
+{
+ tcp_req_cont_keywords_register(&tcp_cont_kws);
+ http_req_keywords_register(&http_req_kws);
+ sample_register_fetches(&smp_kws);
+}
+
/*
* Local variables:
* c-indent-level: 8
diff --git a/src/stream.c b/src/stream.c
index f774406dd..6771302bb 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -220,6 +220,8 @@ struct stream *stream_new(struct session *sess, enum obj_type *origin)
s->target = sess->listener ? sess->listener->default_target : NULL;

s->pend_pos = NULL;
+ s->priority_class = 0;
+ s->priority_offset = 0;

/* init store persistence */
s->store_count = 0;
--
2.16.3

From 8d9a9aaab2f617cf31de9528dcc08995d72612c6 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 5/6] MEDIUM: queue: adjust position based on priority-class
and priority-offset

The priority values are used when connections are queued to determine
which connections should be served first. The lowest priority class is
served first. When multiple requests from the same class are found, the
earliest (according to queue_time + offset) is served first.
---
src/queue.c | 157 +++++++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 119 insertions(+), 38 deletions(-)

diff --git a/src/queue.c b/src/queue.c
index a92cea4b6..23569abac 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -26,6 +26,11 @@
#include <proto/tcp_rules.h>


+#define NOW_OFFSET_BOUNDARY() (now_ms - (TIMER_LOOK_BACK >> 12) & 0xfffff)
+#define KEY_CLASS(key) (key & 0xfff00000)
+#define KEY_OFFSET(key) (key & 0xfffff)
+#define KEY_CLASS_OFFSET_BOUNDARY(key) (KEY_CLASS(key) | NOW_OFFSET_BOUNDARY())
+
struct pool_head *pool_head_pendconn;

/* perform minimal intializations, report 0 in case of error, 1 if OK. */
@@ -86,6 +91,65 @@ static void pendconn_unlink(struct pendconn *p)
eb32_delete(&p->node);
}

+/* Retrieve the next pending connection.
+ *
+ * See pendconn_add for an explanation of the key & queue behavior.
+ *
+ * This function handles all the cases where due to the timestamp wrapping
+ * the first node in the tree is not the highest priority.
+ */
+static struct pendconn *pendconn_next(struct eb_root *pendconns, struct pendconn *curconn)
+{
+ struct eb32_node *node, *node2 = NULL;
+ u32 min, max;
+
+ /* min and max are the lower and upper bounds of the priority offset within
+ * the current class to search for the next node. It is possible that
+ * (max <= min) if max wraps around.
+ * min is inclusive. max is exclusive.
+ */
+ if (curconn == NULL) {
+ min = NOW_OFFSET_BOUNDARY();
+ max = min;
+ node = eb32_lookup_ge(pendconns, min);
+ } else {
+ min = curconn->node.key;
+ max = KEY_CLASS_OFFSET_BOUNDARY(min);
+ node = eb32_next(&curconn->node);
+ }
+
+ if (node) {
+ if (node->key < max || (max <= min && KEY_CLASS(node->key) == KEY_CLASS(min)))
+ return eb32_entry(node, struct pendconn, node);
+ if (KEY_CLASS(node->key) != KEY_CLASS(min))
+ node2 = node;
+ if (max > min)
+ goto class_next;
+ }
+
+ if (max <= min)
+ node = eb32_lookup_ge(pendconns, KEY_CLASS(min));
+ if (!node)
+ return NULL;
+ if (node->key < max && node->key < min)
+ return eb32_entry(node, struct pendconn, node);
+
+class_next:
+ if (node2) {
+ min = KEY_CLASS_OFFSET_BOUNDARY(node2->key);
+ if (node2->key >= min)
+ return eb32_entry(node2, struct pendconn, node);
+ } else
+ min = KEY_CLASS_OFFSET_BOUNDARY(min) + 0x100000;
+ node = eb32_lookup_ge(pendconns, min);
+ if (node && KEY_CLASS(node->key) == KEY_CLASS(min))
+ return eb32_entry(node, struct pendconn, node);
+ if (node2)
+ return eb32_entry(node2, struct pendconn, node);
+
+ return NULL;
+}
+
/* Process the next pending connection from either a server or a proxy, and
* returns a strictly positive value on success (see below). If no pending
* connection is found, 0 is returned. Note that neither <srv> nor <px> may be
@@ -109,7 +173,7 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
struct pendconn *p = NULL;
struct pendconn *pp = NULL;
struct server *rsrv;
- struct eb32_node *node;
+ u32 pkey, ppkey;
int remote;

rsrv = srv->track;
@@ -117,42 +181,52 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
rsrv = srv;

if (srv->nbpend) {
- for (node = eb32_first(&srv->pendconns);
- node;
- node = eb32_next(node)) {
- p = eb32_entry(node, struct pendconn, node);
+ for (p = pendconn_next(&srv->pendconns, NULL);
+ p;
+ p = pendconn_next(&srv->pendconns, p))
if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &p->lock))
- goto ps_found;
- }
- p = NULL;
+ break;
}
-
- ps_found:
- if (srv_currently_usable(rsrv) && px->nbpend) {
- for (node = eb32_first(&px->pendconns);
- node;
- node = eb32_next(node)) {
- pp = eb32_entry(node, struct pendconn, node);
- /* If the server pendconn is older than the proxy one,
- * we process the server one. */
- if (p && !tv_islt(&pp->strm->logs.tv_request, &p->strm->logs.tv_request))
- goto pendconn_found;
-
- if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &pp->lock)) {
- /* Let's switch from the server pendconn to the
- * proxy pendconn. Don't forget to unlock the
- * server pendconn, if any. */
- if (p)
- HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
- p = pp;
- goto pendconn_found;
- }
- }
+ if (px->nbpend) {
+ for (pp = pendconn_next(&px->pendconns, NULL);
+ pp;
+ pp = pendconn_next(&px->pendconns, pp))
+ if (!HA_SPIN_TRYLOCK(PENDCONN_LOCK, &pp->lock))
+ break;
}

- if (!p)
+ if (!p && !pp)
return 0;

+ if (p && !pp)
+ goto pendconn_found;
+ if (pp && !p) {
+ p = pp;
+ goto pendconn_found;
+ }
+ if (KEY_CLASS(p->node.key) < KEY_CLASS(pp->node.key)) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &pp->lock);
+ goto pendconn_found;
+ }
+ if (KEY_CLASS(pp->node.key) < KEY_CLASS(p->node.key)) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
+ p = pp;
+ goto pendconn_found;
+ }
+
+ pkey = KEY_OFFSET(p->node.key);
+ ppkey = KEY_OFFSET(pp->node.key);
+ if (pkey < NOW_OFFSET_BOUNDARY())
+ pkey += 0x100000;
+ if (ppkey < NOW_OFFSET_BOUNDARY())
+ ppkey += 0x100000;
+ if (pkey <= ppkey) {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &pp->lock);
+ } else {
+ HA_SPIN_UNLOCK(PENDCONN_LOCK, &p->lock);
+ p = pp;
+ }
+
pendconn_found:
pendconn_unlink(p);
p->strm_flags |= SF_ASSIGNED;
@@ -201,12 +275,20 @@ void process_srv_queue(struct server *s)
THREAD_WANT_SYNC();
}

-/* Adds the stream <strm> to the pending connection list of server <strm>->srv
+/* Adds the stream <strm> to the pending connection queue of server <strm>->srv
* or to the one of <strm>->proxy if srv is NULL. All counters and back pointers
* are updated accordingly. Returns NULL if no memory is available, otherwise the
* pendconn itself. If the stream was already marked as served, its flag is
* cleared. It is illegal to call this function with a non-NULL strm->srv_conn.
*
+ * The queue is sorted by the composition of the priority_class, and the current
+ * timestamp offset by strm->priority_offset. The timestamp is in milliseconds
+ * and truncated to 20 bits, so will wrap every 17m28s575ms.
+ * The offset can be positive or negative, and an offset of 0 puts it in the
+ * middle of this range (~ 8 min). Note that this also means if the adjusted
+ * timestamp wraps around, the request will be misinterpreted as being of
+ * the higest priority for that priority class.
+ *
* This function must be called by the stream itself, so in the context of
* process_stream.
*/
@@ -223,7 +305,8 @@ struct pendconn *pendconn_add(struct stream *strm)
srv = objt_server(strm->target);
px = strm->be;

- p->node.key = 0;
+ p->node.key = ((u32)(strm->priority_class + 0x7ff) << 20) |
+ ((u32)(now_ms + strm->priority_offset) & 0xfffff);
p->srv = NULL;
p->px = px;
p->strm = strm;
@@ -303,7 +386,6 @@ int pendconn_redistribute(struct server *s)
int pendconn_grab_from_px(struct server *s)
{
struct pendconn *p;
- struct eb32_node *node;
int maxconn, xferred = 0;
int remote = 0;

@@ -312,10 +394,9 @@ int pendconn_grab_from_px(struct server *s)

HA_SPIN_LOCK(PROXY_LOCK, &s->proxy->lock);
maxconn = srv_dynamic_maxconn(s);
- for (node = eb32_first(&s->proxy->pendconns);
- node;
- node = eb32_next(node)) {
- p = eb32_entry(&node, struct pendconn, node);
+ for (p = pendconn_next(&s->proxy->pendconns, NULL);
+ p;
+ p = pendconn_next(&s->proxy->pendconns, p)) {
if (s->maxconn && s->served + xferred >= maxconn)
break;

--
2.16.3

From 7393339d8cd8028ed7881822cbbe0881a4564e27 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 6/6] WIP/config: warning on too high timeout queue

The purpose is to emit a warning if the user configures a larger queue
timeout than the highest we support in the tree.

[WT: I don't think it's a good idea as some people do use very large
timeouts in TCP. Maybe our values are too low in the end and we
should refine them or change the tree size]
---
src/cfgparse.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/src/cfgparse.c b/src/cfgparse.c
index f3c2be499..3d8ba1866 100644
--- a/src/cfgparse.c
+++ b/src/cfgparse.c
@@ -8244,6 +8244,21 @@ out_uri_auth_compat:
}
}

+ if ((curproxy->mode == PR_MODE_TCP || curproxy->mode == PR_MODE_HTTP) &&
+ (curproxy->cap & PR_CAP_BE) && (curproxy->srv) &&
+ (!curproxy->timeout.queue || curproxy->timeout.queue > (((TIMER_LOOK_BACK >> 12) & 0xfffff) / 2))) {
+ // Note that this warning isn't comprehensive.
+ // if the user specifies set-priority-offset > 'timeout queue`, wrapping
+ // may occur and get de-queued out of order. But logging this every
+ // single time might be too noisy.
+ ha_warning("config : excessive timeout queue for backend '%s'.\n"
+ " | The 'timeout queue' setting is either missing, or exceeds the maximum\n"
+ " | priority offset. If a request sits in the queue for longer than the maximum\n"
+ " | priority offset, it may get de-queued out of order.\n",
+ curproxy->id);
+ err_code |= ERR_WARN;
+ }
+
if ((curproxy->options2 & PR_O2_CHK_ANY) == PR_O2_SSL3_CHK) {
curproxy->check_len = sizeof(sslv3_client_hello_pkt) - 1;
curproxy->check_req = malloc(curproxy->check_len);
--
2.16.3
Re-adding the mailing list.

On 2018/8/6 22:37, Willy Tarreau wrote:
> Hi Patrick,
>>> I *think* that the change made to stream_process_counters() is not needed,
>>> because stream_process_counters() is normally used to keep the stats up
>>> to date so that when refreshing the stats page they don't make huge jumps.
>>> If you faced any corner case that this part managed to address, I'm
>>> interested in taking a look.
>> If I recall correctly, the scenario this was meant to address was when a
>> connection was removed from the queue ungracefully (e.g. a timeout).
>> In process_stream(), s->do_log(s) is called, which needs the counters
>> updated. However in this flow, the counters wouldn't get updated until
>> stream_free(s) a few lines later. stream_process_counters() seemed to be
>> made to address this scenario, so I put the code there.
>> I'll review this again though.
> Now I remember I thought about this case initially and I think I figured
> a few rare cases where it could possibly not be enough (maybe for aborts
> processed in HTTP keep-alive but I'm not sure and I prefer not to say too
> much bullshit). I remember sometimes getting an incorrect queue length at
> log time. Thus I have added a few calls to pendconn_dequeue() before all
> calls to do_log() which in my opinion are more robust for the long term.
> Thus it probably means that we can safely get rid of the calls in
> stream_process_counters(), which would be nice.
>
>>> You will notice that the code was significantly simplified thanks to the
>>> most painful part you had to deal with : the possibility for the trylock
>>> to fail. Now there is no trylock anymore and it cannot fail, so the very
>>> complex pendconn_next() is not needed anymore and I could replace it with
>>> a much simpler pendconn_first() which simply gives you the first element
>>> that is expected to be dequeued from a queue.
>> Found one spot in the code that looks like still falls into the scenario
>> where pendconn_next() is needed. This bit:
>>
>> int pendconn_redistribute(struct server *s)
>> ...
>> while ((p = pendconn_first(&s->pendconns))) {
>> if (p->strm_flags & SF_FORCE_PRST)
>> continue;
>>
>> This is going to infinitely loop.
> You're absolutely right, I didn't even notice this one! I think I was
> a bit too fast in replacing all loops with pendconn_first()!
>
>> An alternative option instead of pendconn_next() is on SF_FORCE_PRST we
>> build a new tree (or have some sort of temporary list), and pop them off
>> the current tree.
> In fact I don't think it's an issue, because here we need to remove all
> of them from the tree except the ones with the flag, so we don't even
> need to have something as complex as pendconn_next() to pop them up in
> order. Instead we can safely walk over the whole tree and only ignore
> SF_FORCE_PRST as the original code used to do, but using eb32_next().
> In fact my use of eb32_first() in the "while" loop in the previous
> patch was already wrong due to the "continue". We should simply have :
>
> for (node = eb32_first(&s->pendconns); node; node = eb32_next(node)) {
> p = eb32_entry(&node, struct pendconn, node);
> if (p->strm_flags & SF_FORCE_PRST)
> continue;
> ...
> }
>
> If you want I'll update the patches once I'm at the office. Thanks for
> spotting this one.
>
> Willy
So I went and did that. It looks to work fine.
I also went and removed the queue position counter code from
stream_process_counters(), and the logging still appears to work fine
(but I could have easily missed some potential scenarios).

In regards to the idea you added to the commit message on the queue index
> It could even be further improved to avoid manipulating the stream
> from the queue :
> - queue_idx = initial position when enqueuing
> - queue_idx = measured position when dequeuing

Because the stream can pass through multiple queues, we'd have to make
sure that every time we de-queue, that the stream code pulls the value
and increments itself. However looking at all the various places
pendconn_unlink() gets called, I think it would be difficult for the
stream code to know when the value needs to be pulled.

Anyway, attached is the latest patch set for review again.

-Patrick
From e4acba5847fb11981c72af8430f6e6846211954f Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 6/7] MEDIUM: queue: adjust position based on priority-class
and priority-offset

The priority values are used when connections are queued to determine
which connections should be served first. The lowest priority class is
served first. When multiple requests from the same class are found, the
earliest (according to queue_time + offset) is served first. The queue
offsets can span over roughly 17 minutes after which the offsets will
wrap around. This allows up to 8 minutes spent in the queue with no
reordering.
---
src/queue.c | 105 +++++++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 79 insertions(+), 26 deletions(-)

diff --git a/src/queue.c b/src/queue.c
index 12020a084..3e9371bfa 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -85,6 +85,12 @@
#include <proto/tcp_rules.h>


+#define NOW_OFFSET_BOUNDARY() ((now_ms - (TIMER_LOOK_BACK >> 12)) & 0xfffff)
+#define KEY_CLASS(key) ((u32)key & 0xfff00000)
+#define KEY_OFFSET(key) ((u32)key & 0x000fffff)
+#define KEY_CLASS_OFFSET_BOUNDARY(key) (KEY_CLASS(key) | NOW_OFFSET_BOUNDARY())
+#define MAKE_KEY(class, offset) (((u32)(class + 0x7ff) << 20) | ((u32)(now_ms + offset) & 0xfffff))
+
struct pool_head *pool_head_pendconn;

/* perform minimal intializations, report 0 in case of error, 1 if OK. */
@@ -184,6 +190,33 @@ void pendconn_unlink(struct pendconn *p)
pendconn_queue_unlock(p);
}

+/* Retrieve the first pendconn from tree <pendconns>. Classes are always
+ * considered first, then the time offset. The time does wrap, so the
+ * lookup is performed twice, one to retrieve the first class and a second
+ * time to retrieve the earliest time in this class.
+ */
+static struct pendconn *pendconn_first(struct eb_root *pendconns)
+{
+ struct eb32_node *node, *node2 = NULL;
+ u32 key;
+
+ node = eb32_first(pendconns);
+ if (!node)
+ return NULL;
+
+ key = KEY_CLASS_OFFSET_BOUNDARY(node->key);
+ node2 = eb32_lookup_ge(pendconns, key);
+
+ if (!node2 ||
+ KEY_CLASS(node2->key) != KEY_CLASS(node->key)) {
+ /* no other key in the tree, or in this class */
+ return eb32_entry(node, struct pendconn, node);
+ }
+
+ /* found a better key */
+ return eb32_entry(node2, struct pendconn, node);
+}
+
/* Process the next pending connection from either a server or a proxy, and
* returns a strictly positive value on success (see below). If no pending
* connection is found, 0 is returned. Note that neither <srv> nor <px> may be
@@ -207,40 +240,54 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
struct pendconn *p = NULL;
struct pendconn *pp = NULL;
struct server *rsrv;
- struct eb32_node *node;
+ u32 pkey, ppkey;

rsrv = srv->track;
if (!rsrv)
rsrv = srv;

p = NULL;
- if (srv->nbpend) {
- node = eb32_first(&srv->pendconns);
- p = eb32_entry(node, struct pendconn, node);
- }
+ if (srv->nbpend)
+ p = pendconn_first(&srv->pendconns);

+ pp = NULL;
if (srv_currently_usable(rsrv) && px->nbpend &&
(!(srv->flags & SRV_F_BACKUP) ||
(!px->srv_act &&
- (srv == px->lbprm.fbck || (px->options & PR_O_USE_ALL_BK))))) {
- node = eb32_first(&px->pendconns);
- pp = eb32_entry(node, struct pendconn, node);
-
- /* If the server pendconn is older than the proxy one,
- * we process the server one.
- */
- if (p && !tv_islt(&pp->strm->logs.tv_request, &p->strm->logs.tv_request))
- goto pendconn_found;
-
- /* Let's switch from the server pendconn to the proxy pendconn */
- p = pp;
- goto pendconn_found;
- }
+ (srv == px->lbprm.fbck || (px->options & PR_O_USE_ALL_BK)))))
+ pp = pendconn_first(&px->pendconns);

- if (!p)
+ if (!p && !pp)
return 0;

- pendconn_found:
+ if (p && !pp)
+ goto use_p;
+
+ if (pp && !p)
+ goto use_pp;
+
+ if (KEY_CLASS(p->node.key) < KEY_CLASS(pp->node.key))
+ goto use_p;
+
+ if (KEY_CLASS(pp->node.key) < KEY_CLASS(p->node.key))
+ goto use_pp;
+
+ pkey = KEY_OFFSET(p->node.key);
+ ppkey = KEY_OFFSET(pp->node.key);
+
+ if (pkey < NOW_OFFSET_BOUNDARY())
+ pkey += 0x100000; // key in the future
+
+ if (ppkey < NOW_OFFSET_BOUNDARY())
+ ppkey += 0x100000; // key in the future
+
+ if (pkey <= ppkey)
+ goto use_p;
+
+ use_pp:
+ /* Let's switch from the server pendconn to the proxy pendconn */
+ p = pp;
+ use_p:
__pendconn_unlink(p);
p->strm_flags |= SF_ASSIGNED;
p->target = srv;
@@ -281,7 +328,7 @@ void process_srv_queue(struct server *s)
HA_SPIN_UNLOCK(PROXY_LOCK, &p->lock);
}

-/* Adds the stream <strm> to the pending connection list of server <strm>->srv
+/* Adds the stream <strm> to the pending connection queue of server <strm>->srv
* or to the one of <strm>->proxy if srv is NULL. All counters and back pointers
* are updated accordingly. Returns NULL if no memory is available, otherwise the
* pendconn itself. If the stream was already marked as served, its flag is
@@ -289,6 +336,14 @@ void process_srv_queue(struct server *s)
* The stream's queue position is counted with an offset of -1 because we want
* to make sure that being at the first position in the queue reports 1.
*
+ * The queue is sorted by the composition of the priority_class, and the current
+ * timestamp offset by strm->priority_offset. The timestamp is in milliseconds
+ * and truncated to 20 bits, so will wrap every 17m28s575ms.
+ * The offset can be positive or negative, and an offset of 0 puts it in the
+ * middle of this range (~ 8 min). Note that this also means if the adjusted
+ * timestamp wraps around, the request will be misinterpreted as being of
+ * the higest priority for that priority class.
+ *
* This function must be called by the stream itself, so in the context of
* process_stream.
*/
@@ -310,7 +365,7 @@ struct pendconn *pendconn_add(struct stream *strm)
px = strm->be;
p->target = NULL;
p->srv = srv;
- p->node.key = 0;
+ p->node.key = MAKE_KEY(strm->priority_class, strm->priority_offset);
p->px = px;
p->strm = strm;
p->strm_flags = strm->flags;
@@ -377,7 +432,6 @@ int pendconn_redistribute(struct server *s)
int pendconn_grab_from_px(struct server *s)
{
struct pendconn *p;
- struct eb32_node *node;
int maxconn, xferred = 0;

if (!srv_currently_usable(s))
@@ -394,8 +448,7 @@ int pendconn_grab_from_px(struct server *s)

HA_SPIN_LOCK(PROXY_LOCK, &s->proxy->lock);
maxconn = srv_dynamic_maxconn(s);
- while ((node = eb32_first(&s->proxy->pendconns))) {
- p = eb32_entry(&node, struct pendconn, node);
+ while ((p = pendconn_first(&s->proxy->pendconns))) {
if (s->maxconn && s->served + xferred >= maxconn)
break;

--
2.16.3

From 62573be114e1ffc562cf65b37dce26ca9ccc4416 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 5/7] MEDIUM: add set-priority-class and set-priority-offset

This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content. At this point they are not used
yet, which is the purpose of the next commit, but all the logic to
set and clear the values is there.
---
doc/configuration.txt | 38 ++++++++++++++
doc/lua-api/index.rst | 18 +++++++
include/proto/queue.h | 19 +++++++
include/types/stream.h | 2 +
src/hlua.c | 53 +++++++++++++------
src/queue.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++
src/stream.c | 2 +
7 files changed, 254 insertions(+), 15 deletions(-)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index 295e27c40..48b69a5bd 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -3917,6 +3917,7 @@ http-request { allow | auth [realm <realm>] | redirect <rule> | reject |
replace-value <name> <match-regex> <replace-fmt> |
set-method <fmt> | set-path <fmt> | set-query <fmt> |
set-uri <fmt> | set-tos <tos> | set-mark <mark> |
+ set-priority-class <expr> | set-priority-offset <expr>
add-acl(<file name>) <key fmt> |
del-acl(<file name>) <key fmt> |
del-map(<file name>) <key fmt> |
@@ -4113,6 +4114,24 @@ http-request { allow | auth [realm <realm>] | redirect <rule> | reject |
downloads). This works on Linux kernels 2.6.32 and above and requires
admin privileges.

+ - "set-priority-class" is used to set the queue priority class of the
+ current request. The value must be a sample expression which converts to
+ an integer in the range -2047..2047. Results outside this range will be
+ truncated. The priority class determines the order in which queued
+ requests are processed. Lower values have higher priority.
+
+ - "set-priority-offset" is used to set the queue priority timestamp offset
+ of the current request. The value must be a sample expression which
+ converts to an integer in the range -524287..524287. Results outside this
+ range will be truncated. When a request is queued, it is ordered first by
+ the priority class, then by the current timestamp adjusted by the given
+ offset in milliseconds. Lower values have higher priority.
+ Note that the resulting timestamp is is only tracked with enough precision
+ for 524,287ms (8m44s287ms). If the request is queued long enough to where
+ the adjusted timestamp exceeds this value, it will be misidentified as
+ highest priority. Thus it is important to set "timeout queue" to a value,
+ where when combined with the offset, does not exceed this limit.
+
- "add-acl" is used to add a new entry into an ACL. The ACL must be loaded
from a file (even a dummy empty file). The file name of the ACL to be
updated is passed between parentheses. It takes one argument: <key fmt>,
@@ -9452,6 +9471,7 @@ tcp-request content <action> [{if | unless} <condition>]
- accept : the request is accepted
- reject : the request is rejected and the connection is closed
- capture : the specified sample expression is captured
+ - set-priority-class <expr> | set-priority-offset <expr>
- { track-sc0 | track-sc1 | track-sc2 } <key> [table <table>]
- sc-inc-gpc0(<sc-id>)
- sc-inc-gpc1(<sc-id>)
@@ -9513,6 +9533,24 @@ tcp-request content <action> [{if | unless} <condition>]
The "unset-var" is used to unset a variable. See above for details about
<var-name>.

+ The "set-priority-class" is used to set the queue priority class of the
+ current request. The value must be a sample expression which converts to an
+ integer in the range -2047..2047. Results outside this range will be
+ truncated. The priority class determines the order in which queued requests
+ are processed. Lower values have higher priority.
+
+ The "set-priority-offset" is used to set the queue priority timestamp offset
+ of the current request. The value must be a sample expression which converts
+ to an integer in the range -524287..524287. Results outside this range will be
+ truncated. When a request is queued, it is ordered first by the priority
+ class, then by the current timestamp adjusted by the given offset in
+ milliseconds. Lower values have higher priority.
+ Note that the resulting timestamp is is only tracked with enough precision for
+ 524,287ms (8m44s287ms). If the request is queued long enough to where the
+ adjusted timestamp exceeds this value, it will be misidentified as highest
+ priority. Thus it is important to set "timeout queue" to a value, where when
+ combined with the offset, does not exceed this limit.
+
The "send-spoe-group" is used to trigger sending of a group of SPOE
messages. To do so, the SPOE engine used to send messages must be defined, as
well as the SPOE group to send. Of course, the SPOE engine must refer to an
diff --git a/doc/lua-api/index.rst b/doc/lua-api/index.rst
index ee9dab55c..0c79766eb 100644
--- a/doc/lua-api/index.rst
+++ b/doc/lua-api/index.rst
@@ -1769,6 +1769,24 @@ TXN class
:param class_txn txn: The class txn object containing the data.
:param integer mark: The mark value.

+.. js:function:: TXN.set_priority_class(txn, prio)
+
+ This function adjusts the priority class of the transaction. The value should
+ be within the range -2047..2047. Values outside this range will be
+ truncated.
+
+ See the HAProxy configuration.txt file keyword "http-request" action
+ "set-priority-class" for details.
+
+.. js:function:: TXN.set_priority_offset(txn, prio)
+
+ This function adjusts the priority offset of the transaction. The value
+ should be within the range -524287..524287. Values outside this range will be
+ truncated.
+
+ See the HAProxy configuration.txt file keyword "http-request" action
+ "set-priority-offset" for details.
+
.. _socket_class:

Socket class
diff --git a/include/proto/queue.h b/include/proto/queue.h
index 166da12f4..28709fa99 100644
--- a/include/proto/queue.h
+++ b/include/proto/queue.h
@@ -90,6 +90,25 @@ static inline int may_dequeue_tasks(const struct server *s, const struct proxy *
(!s->maxconn || s->cur_sess < srv_dynamic_maxconn(s)));
}

+static inline int queue_limit_class(int class)
+{
+ if (class < -0x7ff)
+ return -0x7ff;
+ if (class > 0x7ff)
+ return 0x7ff;
+ return class;
+}
+
+static inline int queue_limit_offset(int offset)
+{
+ if (offset < -0x7ffff)
+ return -0x7ffff;
+ if (offset > 0x7ffff)
+ return 0x7ffff;
+ return offset;
+}
+
+
#endif /* _PROTO_QUEUE_H */

/*
diff --git a/include/types/stream.h b/include/types/stream.h
index a3137bf31..feeb56b12 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -131,6 +131,8 @@ struct stream {
struct task *task; /* the task associated with this stream */
unsigned short pending_events; /* the pending events not yet processed by the stream.
* This is a bit field of TASK_WOKEN_* */
+ int16_t priority_class; /* priority class of the stream for the pending queue */
+ int32_t priority_offset; /* priority offset of the stream for the pending queue */

struct list list; /* position in global streams list */
struct list by_srv; /* position in server stream list */
diff --git a/src/hlua.c b/src/hlua.c
index bea9c4694..c29a7cc4e 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -44,6 +44,7 @@
#include <proto/hlua_fcn.h>
#include <proto/map.h>
#include <proto/obj_type.h>
+#include <proto/queue.h>
#include <proto/pattern.h>
#include <proto/payload.h>
#include <proto/proto_http.h>
@@ -5388,6 +5389,26 @@ __LJMP static int hlua_txn_set_mark(lua_State *L)
return 0;
}

+__LJMP static int hlua_txn_set_priority_class(lua_State *L)
+{
+ struct hlua_txn *htxn;
+
+ MAY_LJMP(check_args(L, 2, "set_priority_class"));
+ htxn = MAY_LJMP(hlua_checktxn(L, 1));
+ htxn->s->priority_class = queue_limit_class(MAY_LJMP(luaL_checkinteger(L, 2)));
+ return 0;
+}
+
+__LJMP static int hlua_txn_set_priority_offset(lua_State *L)
+{
+ struct hlua_txn *htxn;
+
+ MAY_LJMP(check_args(L, 2, "set_priority_offset"));
+ htxn = MAY_LJMP(hlua_checktxn(L, 1));
+ htxn->s->priority_offset = queue_limit_offset(MAY_LJMP(luaL_checkinteger(L, 2)));
+ return 0;
+}
+
/* This function is an Lua binding that send pending data
* to the client, and close the stream interface.
*/
@@ -7931,21 +7952,23 @@ void hlua_init(void)
lua_newtable(gL.T);

/* Register Lua functions. */
- hlua_class_function(gL.T, "set_priv", hlua_set_priv);
- hlua_class_function(gL.T, "get_priv", hlua_get_priv);
- hlua_class_function(gL.T, "set_var", hlua_set_var);
- hlua_class_function(gL.T, "unset_var", hlua_unset_var);
- hlua_class_function(gL.T, "get_var", hlua_get_var);
- hlua_class_function(gL.T, "done", hlua_txn_done);
- hlua_class_function(gL.T, "set_loglevel",hlua_txn_set_loglevel);
- hlua_class_function(gL.T, "set_tos", hlua_txn_set_tos);
- hlua_class_function(gL.T, "set_mark", hlua_txn_set_mark);
- hlua_class_function(gL.T, "deflog", hlua_txn_deflog);
- hlua_class_function(gL.T, "log", hlua_txn_log);
- hlua_class_function(gL.T, "Debug", hlua_txn_log_debug);
- hlua_class_function(gL.T, "Info", hlua_txn_log_info);
- hlua_class_function(gL.T, "Warning", hlua_txn_log_warning);
- hlua_class_function(gL.T, "Alert", hlua_txn_log_alert);
+ hlua_class_function(gL.T, "set_priv", hlua_set_priv);
+ hlua_class_function(gL.T, "get_priv", hlua_get_priv);
+ hlua_class_function(gL.T, "set_var", hlua_set_var);
+ hlua_class_function(gL.T, "unset_var", hlua_unset_var);
+ hlua_class_function(gL.T, "get_var", hlua_get_var);
+ hlua_class_function(gL.T, "done", hlua_txn_done);
+ hlua_class_function(gL.T, "set_loglevel", hlua_txn_set_loglevel);
+ hlua_class_function(gL.T, "set_tos", hlua_txn_set_tos);
+ hlua_class_function(gL.T, "set_mark", hlua_txn_set_mark);
+ hlua_class_function(gL.T, "set_priority_class", hlua_txn_set_priority_class);
+ hlua_class_function(gL.T, "set_priority_offset", hlua_txn_set_priority_offset);
+ hlua_class_function(gL.T, "deflog", hlua_txn_deflog);
+ hlua_class_function(gL.T, "log", hlua_txn_log);
+ hlua_class_function(gL.T, "Debug", hlua_txn_log_debug);
+ hlua_class_function(gL.T, "Info", hlua_txn_log_info);
+ hlua_class_function(gL.T, "Warning", hlua_txn_log_warning);
+ hlua_class_function(gL.T, "Alert", hlua_txn_log_alert);

lua_rawset(gL.T, -3);

diff --git a/src/queue.c b/src/queue.c
index 6bcaba5c1..12020a084 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -75,11 +75,14 @@
#include <common/hathreads.h>
#include <eb32tree.h>

+#include <proto/proto_http.h>
#include <proto/queue.h>
+#include <proto/sample.h>
#include <proto/server.h>
#include <proto/stream.h>
#include <proto/stream_interface.h>
#include <proto/task.h>
+#include <proto/tcp_rules.h>


struct pool_head *pool_head_pendconn;
@@ -456,6 +459,140 @@ int pendconn_dequeue(struct stream *strm)
return 0;
}

+static enum act_return action_set_priority_class(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (!smp)
+ return ACT_RET_CONT;
+
+ s->priority_class = queue_limit_class(smp->data.u.sint);
+ return ACT_RET_CONT;
+}
+
+static enum act_return action_set_priority_offset(struct act_rule *rule, struct proxy *px,
+ struct session *sess, struct stream *s, int flags)
+{
+ struct sample *smp;
+
+ smp = sample_fetch_as_type(px, sess, s, SMP_OPT_DIR_REQ|SMP_OPT_FINAL, rule->arg.expr, SMP_T_SINT);
+ if (!smp)
+ return ACT_RET_CONT;
+
+ s->priority_offset = queue_limit_offset(smp->data.u.sint);
+
+ return ACT_RET_CONT;
+}
+
+static enum act_parse_ret parse_set_priority_class(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority_class;
+ return ACT_RET_PRS_OK;
+}
+
+static enum act_parse_ret parse_set_priority_offset(const char **args, int *arg, struct proxy *px,
+ struct act_rule *rule, char **err)
+{
+ unsigned int where = 0;
+
+ rule->arg.expr = sample_parse_expr((char **)args, arg, px->conf.args.file,
+ px->conf.args.line, err, &px->conf.args);
+ if (!rule->arg.expr)
+ return ACT_RET_PRS_ERR;
+
+ if (px->cap & PR_CAP_FE)
+ where |= SMP_VAL_FE_HRQ_HDR;
+ if (px->cap & PR_CAP_BE)
+ where |= SMP_VAL_BE_HRQ_HDR;
+
+ if (!(rule->arg.expr->fetch->val & where)) {
+ memprintf(err,
+ "fetch method '%s' extracts information from '%s', none of which is available here",
+ args[0], sample_src_names(rule->arg.expr->fetch->use));
+ free(rule->arg.expr);
+ return ACT_RET_PRS_ERR;
+ }
+
+ rule->action = ACT_CUSTOM;
+ rule->action_ptr = action_set_priority_offset;
+ return ACT_RET_PRS_OK;
+}
+
+static struct action_kw_list tcp_cont_kws = {ILH, {
+ { "set-priority-class", parse_set_priority_class },
+ { "set-priority-offset", parse_set_priority_offset },
+ { /* END */ }
+}};
+
+static struct action_kw_list http_req_kws = {ILH, {
+ { "set-priority-class", parse_set_priority_class },
+ { "set-priority-offset", parse_set_priority_offset },
+ { /* END */ }
+}};
+
+static int
+smp_fetch_priority_class(const struct arg *args, struct sample *smp, const char *kw, void *private)
+{
+ if (!smp->strm)
+ return 0;
+
+ smp->data.type = SMP_T_SINT;
+ smp->data.u.sint = smp->strm->priority_class;
+
+ return 1;
+}
+
+static int
+smp_fetch_priority_offset(const struct arg *args, struct sample *smp, const char *kw, void *private)
+{
+ if (!smp->strm)
+ return 0;
+
+ smp->data.type = SMP_T_SINT;
+ smp->data.u.sint = smp->strm->priority_offset;
+
+ return 1;
+}
+
+
+static struct sample_fetch_kw_list smp_kws = {ILH, {
+ { "prio_class", smp_fetch_priority_class, 0, NULL, SMP_T_SINT, SMP_USE_INTRN, },
+ { "prio_offset", smp_fetch_priority_offset, 0, NULL, SMP_T_SINT, SMP_USE_INTRN, },
+ { /* END */},
+}};
+
+__attribute__((constructor))
+static void __queue_init(void)
+{
+ tcp_req_cont_keywords_register(&tcp_cont_kws);
+ http_req_keywords_register(&http_req_kws);
+ sample_register_fetches(&smp_kws);
+}
+
/*
* Local variables:
* c-indent-level: 8
diff --git a/src/stream.c b/src/stream.c
index 50b0d6d51..9dbced0d3 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -221,6 +221,8 @@ struct stream *stream_new(struct session *sess, enum obj_type *origin)
s->target = sess->listener ? sess->listener->default_target : NULL;

s->pend_pos = NULL;
+ s->priority_class = 0;
+ s->priority_offset = 0;

/* init store persistence */
s->store_count = 0;
--
2.16.3

From acfde25a97bc1706939b188b16454c7979d0925e Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 4/7] MINOR: queue: replace the linked list with a tree

We'll need trees to manage the queues by priorities. This change replaces
the list with a tree based on a single key. It's effectively a list but
allows us to get rid of the list management right now.
---
include/proto/queue.h | 2 +-
include/types/proxy.h | 2 +-
include/types/queue.h | 3 ++-
include/types/server.h | 2 +-
src/hlua.c | 4 ++--
src/proxy.c | 2 +-
src/queue.c | 35 ++++++++++++++++++++++-------------
src/server.c | 2 +-
8 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/include/proto/queue.h b/include/proto/queue.h
index 11696dbc4..166da12f4 100644
--- a/include/proto/queue.h
+++ b/include/proto/queue.h
@@ -52,7 +52,7 @@ void pendconn_unlink(struct pendconn *p);
*/
static inline void pendconn_cond_unlink(struct pendconn *p)
{
- if (p && !LIST_ISEMPTY(&p->list))
+ if (p && p->node.node.leaf_p)
pendconn_unlink(p);
}

diff --git a/include/types/proxy.h b/include/types/proxy.h
index 234a142c0..37a50609c 100644
--- a/include/types/proxy.h
+++ b/include/types/proxy.h
@@ -322,7 +322,7 @@ struct proxy {
int serverfin; /* timeout to apply to server half-closed connections */
} timeout;
char *id, *desc; /* proxy id (name) and description */
- struct list pendconns; /* pending connections with no server assigned yet */
+ struct eb_root pendconns; /* pending connections with no server assigned yet */
int nbpend; /* number of pending connections with no server assigned yet */
int totpend; /* total number of pending connections on this instance (for stats) */
unsigned int queue_idx; /* number of pending connections which have been de-queued */
diff --git a/include/types/queue.h b/include/types/queue.h
index 575cc5929..ca7191c34 100644
--- a/include/types/queue.h
+++ b/include/types/queue.h
@@ -37,7 +37,8 @@ struct pendconn {
struct proxy *px;
struct server *srv; /* the server we are waiting for, may be NULL if don't care */
struct server *target; /* the server that was assigned, = srv except if srv==NULL */
- struct list list; /* next pendconn */
+ struct eb32_node node;
+ __decl_hathreads(HA_SPINLOCK_T lock);
};

#endif /* _TYPES_QUEUE_H */
diff --git a/include/types/server.h b/include/types/server.h
index 7d0ba4571..88259281d 100644
--- a/include/types/server.h
+++ b/include/types/server.h
@@ -215,7 +215,7 @@ struct server {
struct freq_ctr sess_per_sec; /* sessions per second on this server */
struct be_counters counters; /* statistics counters */

- struct list pendconns; /* pending connections */
+ struct eb_root pendconns; /* pending connections */
struct list actconns; /* active connections */
struct list *priv_conns; /* private idle connections attached to stream interfaces */
struct list *idle_conns; /* sharable idle connections attached or not to a stream interface */
diff --git a/src/hlua.c b/src/hlua.c
index 4d42b1fec..bea9c4694 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -8007,7 +8007,7 @@ void hlua_init(void)
socket_tcp.proxy = &socket_proxy;
socket_tcp.obj_type = OBJ_TYPE_SERVER;
LIST_INIT(&socket_tcp.actconns);
- LIST_INIT(&socket_tcp.pendconns);
+ socket_tcp.pendconns = EB_ROOT;
socket_tcp.priv_conns = NULL;
socket_tcp.idle_conns = NULL;
socket_tcp.safe_conns = NULL;
@@ -8053,7 +8053,7 @@ void hlua_init(void)
socket_ssl.proxy = &socket_proxy;
socket_ssl.obj_type = OBJ_TYPE_SERVER;
LIST_INIT(&socket_ssl.actconns);
- LIST_INIT(&socket_ssl.pendconns);
+ socket_ssl.pendconns = EB_ROOT;
socket_tcp.priv_conns = NULL;
socket_tcp.idle_conns = NULL;
socket_tcp.safe_conns = NULL;
diff --git a/src/proxy.c b/src/proxy.c
index bf87a2f9c..ff094d210 100644
--- a/src/proxy.c
+++ b/src/proxy.c
@@ -726,7 +726,7 @@ void init_new_proxy(struct proxy *p)
{
memset(p, 0, sizeof(struct proxy));
p->obj_type = OBJ_TYPE_PROXY;
- LIST_INIT(&p->pendconns);
+ p->pendconns = EB_ROOT;
LIST_INIT(&p->acl);
LIST_INIT(&p->http_req_rules);
LIST_INIT(&p->http_res_rules);
diff --git a/src/queue.c b/src/queue.c
index 4c8c4c9cd..6bcaba5c1 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -21,7 +21,7 @@
* A stream does not necessarily have such a pendconn. Thus the pendconn is
* designated by the stream->pend_pos pointer. This results in some properties :
* - pendconn->strm->pend_pos is never NULL for any valid pendconn
- * - if LIST_ISEMPTY(pendconn->list) is true, the element is unlinked,
+ * - if p->node.node.leaf_p is NULL, the element is unlinked,
* otherwise it necessarily belongs to one of the other lists ; this may
* not be atomically checked under threads though ;
* - pendconn->px is never NULL if pendconn->list is not empty
@@ -73,6 +73,7 @@
#include <common/memory.h>
#include <common/time.h>
#include <common/hathreads.h>
+#include <eb32tree.h>

#include <proto/queue.h>
#include <proto/server.h>
@@ -137,8 +138,7 @@ static void __pendconn_unlink(struct pendconn *p)
p->px->nbpend--;
}
HA_ATOMIC_SUB(&p->px->totpend, 1);
- LIST_DEL(&p->list);
- LIST_INIT(&p->list);
+ eb32_delete(&p->node);
}

/* Locks the queue the pendconn element belongs to. This relies on both p->px
@@ -204,20 +204,24 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
struct pendconn *p = NULL;
struct pendconn *pp = NULL;
struct server *rsrv;
+ struct eb32_node *node;

rsrv = srv->track;
if (!rsrv)
rsrv = srv;

p = NULL;
- if (srv->nbpend)
- p = LIST_ELEM(srv->pendconns.n, struct pendconn *, list);
+ if (srv->nbpend) {
+ node = eb32_first(&srv->pendconns);
+ p = eb32_entry(node, struct pendconn, node);
+ }

if (srv_currently_usable(rsrv) && px->nbpend &&
(!(srv->flags & SRV_F_BACKUP) ||
(!px->srv_act &&
(srv == px->lbprm.fbck || (px->options & PR_O_USE_ALL_BK))))) {
- pp = LIST_ELEM(px->pendconns.n, struct pendconn *, list);
+ node = eb32_first(&px->pendconns);
+ pp = eb32_entry(node, struct pendconn, node);

/* If the server pendconn is older than the proxy one,
* we process the server one.
@@ -303,6 +307,7 @@ struct pendconn *pendconn_add(struct stream *strm)
px = strm->be;
p->target = NULL;
p->srv = srv;
+ p->node.key = 0;
p->px = px;
p->strm = strm;
p->strm_flags = strm->flags;
@@ -314,14 +319,14 @@ struct pendconn *pendconn_add(struct stream *strm)
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
p->queue_idx = srv->queue_idx - 1; // for increment
- LIST_ADDQ(&srv->pendconns, &p->list);
+ eb32_insert(&srv->pendconns, &p->node);
}
else {
px->nbpend++;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
p->queue_idx = px->queue_idx - 1; // for increment
- LIST_ADDQ(&px->pendconns, &p->list);
+ eb32_insert(&px->pendconns, &p->node);
}
strm->pend_pos = p;

@@ -336,7 +341,8 @@ struct pendconn *pendconn_add(struct stream *strm)
*/
int pendconn_redistribute(struct server *s)
{
- struct pendconn *p, *pback;
+ struct pendconn *p;
+ struct eb32_node *node;
int xferred = 0;

/* The REDISP option was specified. We will ignore cookie and force to
@@ -345,7 +351,8 @@ int pendconn_redistribute(struct server *s)
return 0;

HA_SPIN_LOCK(SERVER_LOCK, &s->lock);
- list_for_each_entry_safe(p, pback, &s->pendconns, list) {
+ for (node = eb32_first(&s->pendconns); node; node = eb32_next(node)) {
+ p = eb32_entry(&node, struct pendconn, node);
if (p->strm_flags & SF_FORCE_PRST)
continue;

@@ -366,7 +373,8 @@ int pendconn_redistribute(struct server *s)
*/
int pendconn_grab_from_px(struct server *s)
{
- struct pendconn *p, *pback;
+ struct pendconn *p;
+ struct eb32_node *node;
int maxconn, xferred = 0;

if (!srv_currently_usable(s))
@@ -383,7 +391,8 @@ int pendconn_grab_from_px(struct server *s)

HA_SPIN_LOCK(PROXY_LOCK, &s->proxy->lock);
maxconn = srv_dynamic_maxconn(s);
- list_for_each_entry_safe(p, pback, &s->proxy->pendconns, list) {
+ while ((node = eb32_first(&s->proxy->pendconns))) {
+ p = eb32_entry(&node, struct pendconn, node);
if (s->maxconn && s->served + xferred >= maxconn)
break;

@@ -428,7 +437,7 @@ int pendconn_dequeue(struct stream *strm)
* unlinked, these functions were completely done.
*/
pendconn_queue_lock(p);
- is_unlinked = LIST_ISEMPTY(&p->list);
+ is_unlinked = !p->node.node.leaf_p;
pendconn_queue_unlock(p);

if (!is_unlinked)
diff --git a/src/server.c b/src/server.c
index c885debca..1d7a5a771 100644
--- a/src/server.c
+++ b/src/server.c
@@ -1627,7 +1627,7 @@ static struct server *new_server(struct proxy *proxy)
srv->obj_type = OBJ_TYPE_SERVER;
srv->proxy = proxy;
LIST_INIT(&srv->actconns);
- LIST_INIT(&srv->pendconns);
+ srv->pendconns = EB_ROOT;

if ((srv->priv_conns = calloc(global.nbthread, sizeof(*srv->priv_conns))) == NULL)
goto free_srv;
--
2.16.3

From e1ef0051745ad16c80d299092958a615f2369478 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 3/7] MINOR: queue: store the queue index in the stream when
enqueuing

We store the queue index in the stream and check it on dequeueing to
figure how many entries were processed in between. This way we'll be
able to count the elements that may later be added before ours.

WARNING! a call to pendconn_free() and/or something similar is missing
in front of a number of calls to do_log() in order to report the correct
dequeue value, especially in case of abort or timeout.
---
include/types/proxy.h | 1 +
include/types/queue.h | 1 +
include/types/server.h | 1 +
src/queue.c | 21 +++++++++++++++------
src/stream.c | 1 +
5 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/types/proxy.h b/include/types/proxy.h
index ec95286b6..234a142c0 100644
--- a/include/types/proxy.h
+++ b/include/types/proxy.h
@@ -325,6 +325,7 @@ struct proxy {
struct list pendconns; /* pending connections with no server assigned yet */
int nbpend; /* number of pending connections with no server assigned yet */
int totpend; /* total number of pending connections on this instance (for stats) */
+ unsigned int queue_idx; /* number of pending connections which have been de-queued */
unsigned int feconn, beconn; /* # of active frontend and backends streams */
struct freq_ctr fe_req_per_sec; /* HTTP requests per second on the frontend */
struct freq_ctr fe_conn_per_sec; /* received connections per second on the frontend */
diff --git a/include/types/queue.h b/include/types/queue.h
index c025b9ce6..575cc5929 100644
--- a/include/types/queue.h
+++ b/include/types/queue.h
@@ -32,6 +32,7 @@ struct stream;

struct pendconn {
int strm_flags; /* stream flags */
+ unsigned int queue_idx; /* value of proxy/server queue_idx at time of enqueue */
struct stream *strm;
struct proxy *px;
struct server *srv; /* the server we are waiting for, may be NULL if don't care */
diff --git a/include/types/server.h b/include/types/server.h
index 7c6d2257b..7d0ba4571 100644
--- a/include/types/server.h
+++ b/include/types/server.h
@@ -210,6 +210,7 @@ struct server {
int cur_sess; /* number of currently active sessions (including syn_sent) */
unsigned maxconn, minconn; /* max # of active sessions (0 = unlimited), min# for dynamic limit. */
int nbpend; /* number of pending connections */
+ unsigned int queue_idx; /* count of pending connections which have been de-queued */
int maxqueue; /* maximum number of pending connections allowed */
struct freq_ctr sess_per_sec; /* sessions per second on this server */
struct be_counters counters; /* statistics counters */
diff --git a/src/queue.c b/src/queue.c
index aa22256b1..4c8c4c9cd 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -129,10 +129,13 @@ unsigned int srv_dynamic_maxconn(const struct server *s)
*/
static void __pendconn_unlink(struct pendconn *p)
{
- if (p->srv)
+ if (p->srv) {
+ p->strm->logs.srv_queue_pos += p->srv->queue_idx - p->queue_idx;
p->srv->nbpend--;
- else
+ } else {
+ p->strm->logs.prx_queue_pos += p->px->queue_idx - p->queue_idx;
p->px->nbpend--;
+ }
HA_ATOMIC_SUB(&p->px->totpend, 1);
LIST_DEL(&p->list);
LIST_INIT(&p->list);
@@ -199,6 +202,7 @@ void pendconn_unlink(struct pendconn *p)
static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
{
struct pendconn *p = NULL;
+ struct pendconn *pp = NULL;
struct server *rsrv;

rsrv = srv->track;
@@ -213,8 +217,6 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
(!(srv->flags & SRV_F_BACKUP) ||
(!px->srv_act &&
(srv == px->lbprm.fbck || (px->options & PR_O_USE_ALL_BK))))) {
- struct pendconn *pp;
-
pp = LIST_ELEM(px->pendconns.n, struct pendconn *, list);

/* If the server pendconn is older than the proxy one,
@@ -236,6 +238,11 @@ static int pendconn_process_next_strm(struct server *srv, struct proxy *px)
p->strm_flags |= SF_ASSIGNED;
p->target = srv;

+ if (p != pp)
+ srv->queue_idx++;
+ else
+ px->queue_idx++;
+
HA_ATOMIC_ADD(&srv->served, 1);
HA_ATOMIC_ADD(&srv->proxy->served, 1);
if (px->lbprm.server_take_conn)
@@ -272,6 +279,8 @@ void process_srv_queue(struct server *s)
* are updated accordingly. Returns NULL if no memory is available, otherwise the
* pendconn itself. If the stream was already marked as served, its flag is
* cleared. It is illegal to call this function with a non-NULL strm->srv_conn.
+ * The stream's queue position is counted with an offset of -1 because we want
+ * to make sure that being at the first position in the queue reports 1.
*
* This function must be called by the stream itself, so in the context of
* process_stream.
@@ -302,16 +311,16 @@ struct pendconn *pendconn_add(struct stream *strm)

if (srv) {
srv->nbpend++;
- strm->logs.srv_queue_pos += srv->nbpend;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
+ p->queue_idx = srv->queue_idx - 1; // for increment
LIST_ADDQ(&srv->pendconns, &p->list);
}
else {
px->nbpend++;
- strm->logs.prx_queue_pos += px->nbpend;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
+ p->queue_idx = px->queue_idx - 1; // for increment
LIST_ADDQ(&px->pendconns, &p->list);
}
strm->pend_pos = p;
diff --git a/src/stream.c b/src/stream.c
index d741a86b9..50b0d6d51 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -488,6 +488,7 @@ void stream_process_counters(struct stream *s)
void *ptr1,*ptr2;
struct stksess *ts;
int i;
+ unsigned int queue_idx;

bytes = s->req.total - s->logs.bytes_in;
s->logs.bytes_in = s->req.total;
--
2.16.3

From dc43cf091a950f08d152f0626a0fd2703bf9e3f3 Mon Sep 17 00:00:00 2001
From: Patrick Hemmer <[email protected]>
Date: Fri, 11 May 2018 12:52:31 -0400
Subject: [PATCH 2/7] MINOR: stream: rename {srv,prx}_queue_size to *_queue_pos

The current name is misleading as it implies a queue size, but the value
instead indicates a position in the queue.
The value is only the queue size at the exact moment the element is enqueued.
Soon we will gain the ability to insert anywhere into the queue, upon which
clarity of the name is more important.
---
include/types/stream.h | 4 ++--
src/log.c | 4 ++--
src/proto_http.c | 4 ++--
src/queue.c | 4 ++--
src/stream.c | 4 ++--
5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/types/stream.h b/include/types/stream.h
index 0dbc79f44..a3137bf31 100644
--- a/include/types/stream.h
+++ b/include/types/stream.h
@@ -105,8 +105,8 @@ struct strm_logs {
long t_connect; /* delay before the connect() to the server succeeds, -1 if never occurs */
long t_data; /* delay before the first data byte from the server ... */
unsigned long t_close; /* total stream duration */
- unsigned long srv_queue_size; /* number of streams waiting for a connect slot on this server at accept() time (in direct assignment) */
- unsigned long prx_queue_size; /* overall number of streams waiting for a connect slot on this instance at accept() time */
+ unsigned long srv_queue_pos; /* number of streams de-queued while waiting for a connection slot on this server */
+ unsigned long prx_queue_pos; /* number of streams de-qeuued while waiting for a connection slot on this instance */
long long bytes_in; /* number of bytes transferred from the client to the server */
long long bytes_out; /* number of bytes transferred from the server to the client */
};
diff --git a/src/log.c b/src/log.c
index e2ced80d8..dcb2ba069 100644
--- a/src/log.c
+++ b/src/log.c
@@ -2135,7 +2135,7 @@ int build_logline(struct stream *s, char *dst, size_t maxsize, struct list *list
break;

case LOG_FMT_SRVQUEUE: // %sq
- ret = ltoa_o(s->logs.srv_queue_size, tmplog, dst + maxsize - tmplog);
+ ret = ltoa_o(s->logs.srv_queue_pos, tmplog, dst + maxsize - tmplog);
if (ret == NULL)
goto out;
tmplog = ret;
@@ -2143,7 +2143,7 @@ int build_logline(struct stream *s, char *dst, size_t maxsize, struct list *list
break;

case LOG_FMT_BCKQUEUE: // %bq
- ret = ltoa_o(s->logs.prx_queue_size, tmplog, dst + maxsize - tmplog);
+ ret = ltoa_o(s->logs.prx_queue_pos, tmplog, dst + maxsize - tmplog);
if (ret == NULL)
goto out;
tmplog = ret;
diff --git a/src/proto_http.c b/src/proto_http.c
index 9cbf3edda..999c39ebf 100644
--- a/src/proto_http.c
+++ b/src/proto_http.c
@@ -4426,8 +4426,8 @@ void http_end_txn_clean_session(struct stream *s)
s->logs.t_connect = -1;
s->logs.t_data = -1;
s->logs.t_close = 0;
- s->logs.prx_queue_size = 0; /* we get the number of pending conns before us */
- s->logs.srv_queue_size = 0; /* we will get this number soon */
+ s->logs.prx_queue_pos = 0; /* we get the number of pending conns before us */
+ s->logs.srv_queue_pos = 0; /* we will get this number soon */

s->logs.bytes_in = s->req.total = ci_data(&s->req);
s->logs.bytes_out = s->res.total = ci_data(&s->res);
diff --git a/src/queue.c b/src/queue.c
index 57fd087bf..aa22256b1 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -302,14 +302,14 @@ struct pendconn *pendconn_add(struct stream *strm)

if (srv) {
srv->nbpend++;
- strm->logs.srv_queue_size += srv->nbpend;
+ strm->logs.srv_queue_pos += srv->nbpend;
if (srv->nbpend > srv->counters.nbpend_max)
srv->counters.nbpend_max = srv->nbpend;
LIST_ADDQ(&srv->pendconns, &p->list);
}
else {
px->nbpend++;
- strm->logs.prx_queue_size += px->nbpend;
+ strm->logs.prx_queue_pos += px->nbpend;
if (px->nbpend > px->be_counters.nbpend_max)
px->be_counters.nbpend_max = px->nbpend;
LIST_ADDQ(&px->pendconns, &p->list);
diff --git a/src/stream.c b/src/stream.c
index aa021b0b3..d741a86b9 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -125,8 +125,8 @@ struct stream *stream_new(struct session *sess, enum obj_type *origin)
s->logs.t_data = -1;
s->logs.t_close = 0;
s->logs.bytes_in = s->logs.bytes_out = 0;
- s->logs.prx_queue_size = 0; /* we get the number of pending conns before us */
- s->logs.srv_queue_size = 0; /* we will get this number soon */
+ s->logs.prx_queue_pos = 0; /* we get the number of pending conns before us */
+ s->logs.srv_queue_pos = 0; /* we will get this number soon */

/* default logging function */
s->do_log = strm_log;
--
2.16.3

From 734fcb3408a21e909bbc487c8e7d66acaa78f377 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Wed, 25 Jul 2018 06:55:12 +0200
Subject: [PATCH 1/7] MINOR: queue: make sure the pendconn is released before
logging

We'll soon need to rely on the pendconn position at the time of dequeuing
to figure the position a stream took in the queue. Usually it's not a
problem since pendconn_free() is called once the connection starts, but
it will make a difference for failed dequeues (eg: queue timeout reached).
Thus it's important to call pendconn_free() before logging in cases we are
not certain whether it was already performed, and to call pendconn_unlink()
after we know the pendconn will not be used so that we collect the queue
state as accurately as possible. As a benefit it will also make the
server's and backend's queues count more accurate in these cases.
---
src/proto_http.c | 5 +++--
src/stream.c | 14 ++++++++++++++
2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/src/proto_http.c b/src/proto_http.c
index 42c9d1a62..9cbf3edda 100644
--- a/src/proto_http.c
+++ b/src/proto_http.c
@@ -4403,6 +4403,9 @@ void http_end_txn_clean_session(struct stream *s)
s->logs.bytes_in -= ci_data(&s->req);
s->logs.bytes_out -= ci_data(&s->res);

+ /* we may need to know the position in the queue */
+ pendconn_free(s);
+
/* let's do a final log if we need it */
if (!LIST_ISEMPTY(&fe->logformat) && s->logs.logwait &&
!(s->flags & SF_MONITOR) &&
@@ -4429,8 +4432,6 @@ void http_end_txn_clean_session(struct stream *s)
s->logs.bytes_in = s->req.total = ci_data(&s->req);
s->logs.bytes_out = s->res.total = ci_data(&s->res);

- pendconn_free(s);
-
if (objt_server(s->target)) {
if (s->flags & SF_CURR_SESS) {
s->flags &= ~SF_CURR_SESS;
diff --git a/src/stream.c b/src/stream.c
index d794c28a9..aa021b0b3 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -799,6 +799,7 @@ static void sess_establish(struct stream *s)
/* if the user wants to log as soon as possible, without counting
* bytes from the server, then this is the right moment. */
if (!LIST_ISEMPTY(&strm_fe(s)->logformat) && !(s->logs.logwait & LW_BYTES)) {
+ /* note: no pend_pos here, session is established */
s->logs.t_close = s->logs.t_connect; /* to get a valid end date */
s->do_log(s);
}
@@ -910,6 +911,9 @@ static void sess_update_stream_int(struct stream *s)

s->logs.t_queue = tv_ms_elapsed(&s->logs.tv_accept, &now);

+ /* we may need to know the position in the queue for logging */
+ pendconn_cond_unlink(s->pend_pos);
+
/* no stream was ever accounted for this server */
si->state = SI_ST_CLO;
if (s->srv_error)
@@ -950,6 +954,10 @@ static void sess_update_stream_int(struct stream *s)
/* ... and timeout expired */
si->exp = TICK_ETERNITY;
s->logs.t_queue = tv_ms_elapsed(&s->logs.tv_accept, &now);
+
+ /* we may need to know the position in the queue for logging */
+ pendconn_cond_unlink(s->pend_pos);
+
if (srv)
HA_ATOMIC_ADD(&srv->counters.failed_conns, 1);
HA_ATOMIC_ADD(&s->be->be_counters.failed_conns, 1);
@@ -967,6 +975,10 @@ static void sess_update_stream_int(struct stream *s)
/* Connection remains in queue, check if we have to abort it */
if (check_req_may_abort(req, s)) {
s->logs.t_queue = tv_ms_elapsed(&s->logs.tv_accept, &now);
+
+ /* we may need to know the position in the queue for logging */
+ pendconn_cond_unlink(s->pend_pos);
+
si->err_type |= SI_ET_QUEUE_ABRT;
goto abort_connection;
}
@@ -2503,6 +2515,8 @@ struct task *process_stream(struct task *t, void *context, unsigned short state)
if (!LIST_ISEMPTY(&sess->fe->logformat) && s->logs.logwait &&
!(s->flags & SF_MONITOR) &&
(!(sess->fe->options & PR_O_NULLNOLOG) || req->total)) {
+ /* we may need to know the position in the queue */
+ pendconn_free(s);
s->do_log(s);
}

--
2.16.3
Hi Patrick,

On Thu, Aug 09, 2018 at 06:29:33PM -0400, Patrick Hemmer wrote:
> I also went and removed the queue position counter code from
> stream_process_counters(), and the logging still appears to work fine
> (but I could have easily missed some potential scenarios).

OK nice!

> In regards to the idea you added to the commit message on the queue index
> > It could even be further improved to avoid manipulating the stream
> > from the queue :
> > - queue_idx = initial position when enqueuing
> > - queue_idx = measured position when dequeuing
>
> Because the stream can pass through multiple queues, we'd have to make
> sure that every time we de-queue, that the stream code pulls the value
> and increments itself. However looking at all the various places
> pendconn_unlink() gets called, I think it would be difficult for the
> stream code to know when the value needs to be pulled.

In fact the only two ways to be queued multiple times are :
- get redispatched after a failed connection attempt to a server ; in
this case the pendconn was already dequeued to connect to the first
server so the state is unambigous

- being redistributed when a pendconn was lying in a server's queue
and the server is marked down. In this case the pendconn is simply
unlinked, process_stream is woken up and I don't know if we collect
queue_idx during the call to process_stream before queuing again.

Anyway this is very minor, I'd even say cosmetic. I'm pretty fine with
the current state already!

> Anyway, attached is the latest patch set for review again.

I've applied it to a local temporary branch and am willing to merge it
if you're OK. I've only adjusted 3 minor things :
- removed the warning in the 3rd patch's commit message mentioning
the missing calls to pendconn_unlink() before logs, because these
ones were finally moved to patch 1 but we didn't edit this message ;

- removed the now unused queue_idx from stream_process_counters()
which emits a warning since it's unused

- removed the "__decl_hathreads(HA_SPINLOCK_T lock);" from struct
pendconn that is a leftover from the rebase onto the last queue
updates.

So please just let me know, I'm ready ;-)

Thanks,
Willy
On 2018/8/10 09:19, Willy Tarreau wrote:
> Hi Patrick,
>
> On Thu, Aug 09, 2018 at 06:29:33PM -0400, Patrick Hemmer wrote:
>> I also went and removed the queue position counter code from
>> stream_process_counters(), and the logging still appears to work fine
>> (but I could have easily missed some potential scenarios).
> OK nice!
>
>> In regards to the idea you added to the commit message on the queue index
>>> It could even be further improved to avoid manipulating the stream
>>> from the queue :
>>> - queue_idx = initial position when enqueuing
>>> - queue_idx = measured position when dequeuing
>> Because the stream can pass through multiple queues, we'd have to make
>> sure that every time we de-queue, that the stream code pulls the value
>> and increments itself. However looking at all the various places
>> pendconn_unlink() gets called, I think it would be difficult for the
>> stream code to know when the value needs to be pulled.
> In fact the only two ways to be queued multiple times are :
> - get redispatched after a failed connection attempt to a server ; in
> this case the pendconn was already dequeued to connect to the first
> server so the state is unambigous
>
> - being redistributed when a pendconn was lying in a server's queue
> and the server is marked down. In this case the pendconn is simply
> unlinked, process_stream is woken up and I don't know if we collect
> queue_idx during the call to process_stream before queuing again.
>
> Anyway this is very minor, I'd even say cosmetic. I'm pretty fine with
> the current state already!
>
>> Anyway, attached is the latest patch set for review again.
> I've applied it to a local temporary branch and am willing to merge it
> if you're OK. I've only adjusted 3 minor things :
> - removed the warning in the 3rd patch's commit message mentioning
> the missing calls to pendconn_unlink() before logs, because these
> ones were finally moved to patch 1 but we didn't edit this message ;
>
> - removed the now unused queue_idx from stream_process_counters()
> which emits a warning since it's unused
>
> - removed the "__decl_hathreads(HA_SPINLOCK_T lock);" from struct
> pendconn that is a leftover from the rebase onto the last queue
> updates.
>
> So please just let me know, I'm ready ;-)
>
> Thanks,
> Willy
Those changes seem good to me.
Merge it!
Quite happy to see this done with. Any issues can be addressed as bug fixes.

Thanks

-Patrick
On Fri, Aug 10, 2018 at 09:54:38AM -0400, Patrick Hemmer wrote:
> Merge it!

Now done!

> Quite happy to see this done with. Any issues can be addressed as bug fixes.

Yes, same here. I've updated the ROADMAP file to remove this 8-years
old entry :-)

Thanks!
Willy
Hi.

Am 10-08-2018 17:12, schrieb Willy Tarreau:
> On Fri, Aug 10, 2018 at 09:54:38AM -0400, Patrick Hemmer wrote:
>> Merge it!
>
> Now done!
>
>> Quite happy to see this done with. Any issues can be addressed as bug
>> fixes.
>
> Yes, same here. I've updated the ROADMAP file to remove this 8-years
> old entry :-)

Great ;-)

As we now have this feature, Patrick do you plan to create a blog post
or a example config for this new feature?

Sorry to say this but I haven't understand this feature fully.

I was going through the discussion and there are some mentions about
time based and a example with date.

I have found some informations about "load balancer priority based" but
I'm not sure if this matches the implementation in haproxy.

https://docs.citrix.com/en-us/netscaler/12-1/priority-load-balancing.html
https://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/ltm-basics-12-1-0/4.html#unique_260681479
https://docs.microsoft.com/en-us/azure/traffic-manager/traffic-manager-routing-methods#priority

Let me explain how I understand this feature and you can fix my gabs for
this feature.

I think it is a requeue mechanism for the connection and not a selector
for a server or a backend, but I can be wrong.

###

listen|frontend test001
bind :8080
# some other options
http-request set-priority-class -10s if PAY_BUTTON
http-request set-priority-offset 5s if LOGO
http-request set-priority-offset 5s if LOGO

# as a sample expression can also be this, right?
http-request set-priority-class \
%[req.fhdr(User-Agent),51d.single(DeviceType,IsMobile,IsTablet)]

# could this work ?
use_backend high_priority if priority-class > 5
###

The set-priority-offset is for separate the *same* classes in more
pieces right.

request 500 -> class -4 -> set-priority-offset 0
request 600 -> class -4 -> offset -3

request 600 will be delivered before 500 right?


I'm also willing to write such a blog post as soon as I have understood
the feature ;-), similar to spoe feature.
https://www.me2digital.com/blog/2017/02/15-haproxy-1-7-feature-spoe/


> Thanks!
> Willy

Best regards
Aleks
Hi Aleks,

On Sat, Aug 11, 2018 at 10:27:42AM +0200, Aleksandar Lazic wrote:
> I think it is a requeue mechanism for the connection and not a selector for
> a server or a backend, but I can be wrong.

That's it. I'll try to explain shortly. You're probably aware how the
queuing mechanism works in haproxy when maxconn is set on a server :
once maxconn is reached, requests are queue either into the server's
queue, if something forces this request to use this specific server
(cookie, ...) otherwise into the backend's queue. Till now, our queues
were very fair (it was already shown on some blog posts from various
users over the last decade). A server would always consult both its
own queue and the backend's queue and pick the oldest request so that
the order of arrival was always preserved.

Now with this new feature, it is possible to change the dequeuing
order when a server picks a request from the queues. There are two
criteria for this :
- the request's class
- the position in the queue for a given class.

The server will always consult classes with lowest value first. And
within this class, it will pick the oldest waiting request. The
"set-priority-class" action allows you to set the class, and mark a
request as very important or unimportant. A good rule of thumb is to
consider that rare but important requests should be given a low class
number, that common requests should be given the default class number
(unchanged) and that expensive and painful requests which are not very
urgent should be given a high class number (dequeued only once no other
request competes with them). The "set-priority-offset" action allows to
tweak the apparent age of the request in the queue so that a server can
pick a request before another one for a given class. This is not used
to deal with the request's importance, but with the preferred response
time. For example, if your site periodically refreshes the pages but
cannot start to render before all objects are loaded, it could happen
that the browser clears the page when reloading the HTML part, then
takes ages to fetch css, js, etc, making the page unreadable for several
seconds every minute for a reader. By applying a time offset to these
extra components, you will make the mandatory elements such as CSS or JS
load much faster than the HTML one, resulting in an extra delay before
starting to fetch the HTML part (hence the page is not yet cleared), and
a shorter delay for the CSS/JS parts (making it render faster once cleared).
Similarly you could be a hosting provider willing to offer certain
guarantees on the page load time to certain customers. With this you
could easily say that those paying for premium service are guaranteed
to see a 100 ms faster load time than those paying the entry level one,
simply by applying a -100ms offset on their requests.

Some people want to use such features the other way around. Certain
sites have low importance stuff like an avatar upload button, a profile
settings page, etc, on which people "waste their time" instead of making
use of the main site, but still making use of the bandwidth and processing
power. Sometimes they simply want to say that such auxiliary requests are
applied an extra delay when there's competition to reach the servers. Just
by adding an extra 1 second offset to such requests, you'll slow them down
when the servers are highly loaded, and preserve the resources for the
other requests.

It's really impressive when you run two load generators in parallel. You
will typically see, say, 100 ms response time for both of them by default.
Then you apply +100ms to the requests coming from one of them, and then
you'll see a distribution like 66/166ms with the load of the slow one
dropping as its response time increases, giving more resources to the
other one, allowing its traffic to be processed faster.

I hope I helped clear your doubts.

I'm having a few comments below :

> listen|frontend test001
> bind :8080
> # some other options
> http-request set-priority-class -10s if PAY_BUTTON

The class just uses integers, not delays. Also it's an interesting case
you have here, as many sites prefer to *increase* the delay on the pay
button or even once there's anything in your cart. There's a reason :
visitors who don't find what they're looking for are likely to quit the
site, so you need to improve their experience while they're searching.
Once their shopping cart is filled with a few articles, they're not
willing to spend as much time looking for them again on another site,
so they're very likely to accept to wait (even if that scares them).
Thus by delaying their traffic you can improve others' experience. Yes
I know it's dirty, but I don't operate a shopping site :-)

> http-request set-priority-offset 5s if LOGO

That's a typically good example, yes, as nobody cares about the logo
being loaded fast.

> # as a sample expression can also be this, right?
> http-request set-priority-class \
> %[req.fhdr(User-Agent),51d.single(DeviceType,IsMobile,IsTablet)]

No, it takes a constant expression. I'm not sure whether it would
really make sense to make it support sample expressions. I'm not
necessarily against it, it's just that I think it comes with extra
cost and complexity for very low value in the end. Very likely you
can do this using 2 or 3 rules only.

> # could this work ?
> use_backend high_priority if priority-class > 5

We don't have a sample fetch to retrieve the value, but it would be
trivial to implement as the value is present in the stream. Feel free
to take a look at how priority-class and priority-offset work for this.
Right now you can already achieve this by setting a variable anyway, but
I can see value in using the class to take a routing decision.

Cheers,
Willy
Sorry, only registered users may post in this forum.

Click here to login