Welcome! Log In Create A New Profile

Advanced

Stick table sync problems

Posted by Aaron van Meerten 
Aaron van Meerten
Stick table sync problems
March 16, 2017 04:30AM
Hi HAProxy List,

I’ve run into an issue with the stick tables/peering issue that may be of interest to some of you.

I’ve got a fleet of 10 proxy servers peering with each other, fronting several backend servers. I have a very simple stick table setup which I’ve pasted examples of below. Basically I use a URL parameter to control server stickiness.

This works great, and is an amazing solution to a sticky problem for our BOSH-based XMPP messaging, as long as the stick table entries stay in sync. However, sometimes one HAProxy instance will lose one or more entries which are still present on the others.

This state persists between minutes and hours, in which the out-of-sync server continues to receive updates on some entries but is missing others.

A restart of the server can resolve the issue by causing the table to refresh, but this is less than ideal.

When it occurs, it appears that all the other servers continue to update the “TTL” on the entry, but the errant server slowly allows the entry to expire and be removed.
I have developed a tool which pulls the stick table from each proxy and compares the entries. There’s obviously some room for expiry times to be different on each proxy, but I’d expect that entries which are regularly refreshed on all other peers should be propagated everywhere.

I suspect somehow either ephemeral network connectivity between the peers or some other error, but I haven’t seen anything in the logs that seem relevant.

lsof analysis of open TCP sockets shows all peers connected on 1024 as expected.

I wondered if this list would have any ideas on further avenues for analysis on this particular problem. I’ve seen this happen consistently on HAProxy 1.6 and 1.7 through several point releases of each. If anything it seems more frequent in 1.7.

Please let me know if you have any good ideas or if anyone has seen behavior like this before.

Thanks,

-Aaron van Meerten

Below is the example of my peer and stick table configuration, extracted from a larger haproxy.cfg
If there’s more info that’d help track this down, I’m happy to provide it.


peers mypeers
peer hcv-chaos-haproxy-13056 XX.XX.130.56:1024
peer hcv-chaos-haproxy-230228 XX.XX.230.228:1024
peer hcv-chaos-haproxy-35147 XX.XX.35.147:1024
peer hcv-chaos-haproxy-10660 10.186.3.137:1024
peer hcv-chaos-haproxy-9682 XX.XX.96.82:1024
peer hcv-chaos-haproxy-239179 XX.XX.239.179:1024
peer hcv-chaos-haproxy-246171 XX.XX.246.171:1024
peer hcv-chaos-haproxy-68128 XX.XX.68.128:1024
peer hcv-chaos-haproxy-151101 XX.XX.151.101:1024
peer hcv-chaos-haproxy-207217 XX.XX.207.217:1024


backend nodes
redirect scheme https if !{ ssl_fc }

# make sure we send the client's ip
option forwardfor

balance url_param room
hash-type consistent
stick-table type string len 128 size 20k peers mypeers expire 5m
stick on url_param(room) table nodes

#example server
server chaos-us-east-1a-s0 XX.XX.XX.XXX:443 id 10 ssl verify none check port 8888 inter 5s fastinter 1s fall 2 rise 30
Aaron van Meerten
Re: Stick table sync problems
March 28, 2017 10:10PM
Hi HAProxy List,

I posted the following approximately 2 weeks ago and was hoping that someone else might have experienced these inconsistencies within the stick tables between peers. It seems to be an issue even in the latest release (HAProxy 1.7.4). I hope to get some guidance on what information I could collect which might be of interest to the developers or the community.

Would a tcpdump of the chatter between peers (on TCP port 1024) be of use? I cannot always predict when the stick table corruption will occur, but I can try to collect some data about the traffic between the peers once the corruption has happened.

Or is there anything else I could be doing to increase the logging with relation to peer connections and stick table updates? At the moment I don’t see anything in the HAProxy logs related to this feature.

Thanks again for this amazing, product, I’m still a very happy user!

Cheers,

-Aaron


> On Mar 15, 2017, at 22:22, Aaron van Meerten <[email protected]> wrote:
>
> Hi HAProxy List,
>
> I’ve run into an issue with the stick tables/peering issue that may be of interest to some of you.
>
> I’ve got a fleet of 10 proxy servers peering with each other, fronting several backend servers. I have a very simple stick table setup which I’ve pasted examples of below. Basically I use a URL parameter to control server stickiness.
>
> This works great, and is an amazing solution to a sticky problem for our BOSH-based XMPP messaging, as long as the stick table entries stay in sync. However, sometimes one HAProxy instance will lose one or more entries which are still present on the others.
>
> This state persists between minutes and hours, in which the out-of-sync server continues to receive updates on some entries but is missing others.
>
> A restart of the server can resolve the issue by causing the table to refresh, but this is less than ideal.
>
> When it occurs, it appears that all the other servers continue to update the “TTL” on the entry, but the errant server slowly allows the entry to expire and be removed.
> I have developed a tool which pulls the stick table from each proxy and compares the entries. There’s obviously some room for expiry times to be different on each proxy, but I’d expect that entries which are regularly refreshed on all other peers should be propagated everywhere.
>
> I suspect somehow either ephemeral network connectivity between the peers or some other error, but I haven’t seen anything in the logs that seem relevant.
>
> lsof analysis of open TCP sockets shows all peers connected on 1024 as expected.
>
> I wondered if this list would have any ideas on further avenues for analysis on this particular problem. I’ve seen this happen consistently on HAProxy 1.6 and 1.7 through several point releases of each. If anything it seems more frequent in 1.7.
>
> Please let me know if you have any good ideas or if anyone has seen behavior like this before.
>
> Thanks,
>
> -Aaron van Meerten
>
> Below is the example of my peer and stick table configuration, extracted from a larger haproxy.cfg
> If there’s more info that’d help track this down, I’m happy to provide it.
>
>
> peers mypeers
> peer hcv-chaos-haproxy-13056 XX.XX.130.56:1024
> peer hcv-chaos-haproxy-230228 XX.XX.230.228:1024
> peer hcv-chaos-haproxy-35147 XX.XX.35.147:1024
> peer hcv-chaos-haproxy-10660 10.186.3.137:1024
> peer hcv-chaos-haproxy-9682 XX.XX.96.82:1024
> peer hcv-chaos-haproxy-239179 XX.XX.239.179:1024
> peer hcv-chaos-haproxy-246171 XX.XX.246.171:1024
> peer hcv-chaos-haproxy-68128 XX.XX.68.128:1024
> peer hcv-chaos-haproxy-151101 XX.XX.151.101:1024
> peer hcv-chaos-haproxy-207217 XX.XX.207.217:1024
>
>
> backend nodes
> redirect scheme https if !{ ssl_fc }
>
> # make sure we send the client's ip
> option forwardfor
>
> balance url_param room
> hash-type consistent
> stick-table type string len 128 size 20k peers mypeers expire 5m
> stick on url_param(room) table nodes
>
> #example server
> server chaos-us-east-1a-s0 XX.XX.XX.XXX:443 id 10 ssl verify none check port 8888 inter 5s fastinter 1s fall 2 rise 30
Emeric Brun
Re: Stick table sync problems
March 29, 2017 11:40AM
Hi Aaron,

On 03/28/2017 10:03 PM, Aaron van Meerten wrote:
> Hi HAProxy List,
>
> I posted the following approximately 2 weeks ago and was hoping that someone else might have experienced these inconsistencies within the stick tables between peers. It seems to be an issue even in the latest release (HAProxy 1.7.4). I hope to get some guidance on what information I could collect which might be of interest to the developers or the community.
>
> Would a tcpdump of the chatter between peers (on TCP port 1024) be of use? I cannot always predict when the stick table corruption will occur, but I can try to collect some data about the traffic between the peers once the corruption has happened.
>
> Or is there anything else I could be doing to increase the logging with relation to peer connections and stick table updates? At the moment I don’t see anything in the HAProxy logs related to this feature.
>
> Thanks again for this amazing, product, I’m still a very happy user!
>
> Cheers,
>
> -Aaron
>
>
>> On Mar 15, 2017, at 22:22, Aaron van Meerten <[email protected]> wrote:
>>
>> Hi HAProxy List,
>>
>> I’ve run into an issue with the stick tables/peering issue that may be of interest to some of you.
>>
>> I’ve got a fleet of 10 proxy servers peering with each other, fronting several backend servers. I have a very simple stick table setup which I’ve pasted examples of below. Basically I use a URL parameter to control server stickiness.
>>
>> This works great, and is an amazing solution to a sticky problem for our BOSH-based XMPP messaging, as long as the stick table entries stay in sync. However, sometimes one HAProxy instance will lose one or more entries which are still present on the others.

Is it still the same instance?

>> This state persists between minutes and hours, in which the out-of-sync server continues to receive updates on some entries but is missing others.

In peers protocol, a peer is responsible to push its local updates to the other peers. But A peer won't 'forward' updates coming from an other peer (except for a startup resync request).

So we could reach your case if communication failed between 2 peers (the peer learns the updates from all the peers except one).

>> A restart of the server can resolve the issue by causing the table to refresh, but this is less than ideal.

At restart, the node will ask for a re-sync to any available peer.
>>
>> When it occurs, it appears that all the other servers continue to update the “TTL” on the entry, but the errant server slowly allows the entry to expire and be removed.
>> I have developed a tool which pulls the stick table from each proxy and compares the entries. There’s obviously some room for expiry times to be different on each proxy, but I’d expect that entries which are regularly refreshed on all other peers should be propagated everywhere.
>>
>> I suspect somehow either ephemeral network connectivity between the peers or some other error, but I haven’t seen anything in the logs that seem relevant.
>>
>> lsof analysis of open TCP sockets shows all peers connected on 1024 as expected.
>>

When you are facing the issue, could you launch a tcpdump between this instance and ALL the other peers, to check if they exchange some data.

>> I wondered if this list would have any ideas on further avenues for analysis on this particular problem. I’ve seen this happen consistently on HAProxy 1.6 and 1.7 through several point releases of each. If anything it seems more frequent in 1.7.
>>
>> Please let me know if you have any good ideas or if anyone has seen behavior like this before.
>>
>> Thanks,
>>
>> -Aaron van Meerten

R,
Emeric
Sorry, only registered users may post in this forum.

Click here to login