Welcome! Log In Create A New Profile

Advanced

Issue with parsing DNS from AWS

Posted by Jim Deville 
Jim Deville
Issue with parsing DNS from AWS
June 20, 2018 12:10AM
We have a setup with ECS and AWS's Service Discovery being load balanced by HAProxy in order to support sticky sessions for WebSocket handshakes, and we're working on making it more efficient by upgrading to 1.8.9 and taking advantage of seamless reloads and DNS service discovery. We have a solution almost working, however, we're seeing an issue during scaling when the DNS response crosses a certain size.


We're using the following config (anonymized): https://gist.github.com/jredville/523de951d5ab6b60a0d345516bcf46d4

What we're seeing is:
* if we bring up 3 target servers, they come up as healthy, and traffic is routed appropriately. If we restart haproxy, it comes up healthy
* if we then scale to 4 or more servers, the 4th and additional are never recognized, however, the first 3 stay healthy
* if we restart haproxy with 4 or more servers, no servers come up healthy

We've attempted to modify the init-addr setting, accepted_payload_size, check options, and we've tried with and without a server-template and this is the behavior we consistently see. If we run strace over haproxy, we see it making the DNS requests but never updating the state of the servers. At this point we're not sure if we have something wrong in config or if there is a bug in how haproxy parses responses from AWS. Johnathan (cc'd) has pcap's if that would be helpful as well.

Thanks,
Jim
Jim Deville
Re: Issue with parsing DNS from AWS
June 21, 2018 12:40AM
Attaching an anonymized PCAP from yesterday. The first two packets are the request and response for 4 servers, the second pair is the request and response for 3. The 3-server response parses successfully, and Jonathan was able to find that the 4-server response ends up hitting here https://github.com/haproxy/haproxy/blob/master/src/dns.c#L425.


I'd be happy for any workaround or explanation of what we could do differently, and happy to help get more info, or to try out a patch in our environment to confirm a fix if this is a bug as it seems.


Jim

________________________________
From: Jim Deville
Sent: Tuesday, June 19, 2018 6:00:07 PM
To: haproxy@formilux.org
Cc: Jonathan Works
Subject: Issue with parsing DNS from AWS


We have a setup with ECS and AWS's Service Discovery being load balanced by HAProxy in order to support sticky sessions for WebSocket handshakes, and we're working on making it more efficient by upgrading to 1.8.9 and taking advantage of seamless reloads and DNS service discovery. We have a solution almost working, however, we're seeing an issue during scaling when the DNS response crosses a certain size.


We're using the following config (anonymized): https://gist.github.com/jredville/523de951d5ab6b60a0d345516bcf46d4

What we're seeing is:
* if we bring up 3 target servers, they come up as healthy, and traffic is routed appropriately. If we restart haproxy, it comes up healthy
* if we then scale to 4 or more servers, the 4th and additional are never recognized, however, the first 3 stay healthy
* if we restart haproxy with 4 or more servers, no servers come up healthy

We've attempted to modify the init-addr setting, accepted_payload_size, check options, and we've tried with and without a server-template and this is the behavior we consistently see. If we run strace over haproxy, we see it making the DNS requests but never updating the state of the servers. At this point we're not sure if we have something wrong in config or if there is a bug in how haproxy parses responses from AWS. Johnathan (cc'd) has pcap's if that would be helpful as well.

Thanks,
Jim
Attachments:
open | download - haproxy-dns-srv.pcap (1.1 KB)
Baptiste
Re: Issue with parsing DNS from AWS
June 21, 2018 05:00PM
On Thu, Jun 21, 2018 at 12:29 AM, Jim Deville <[email protected]>
wrote:

> Attaching an anonymized PCAP from yesterday. The first two packets are the
> request and response for 4 servers, the second pair is the request and
> response for 3. The 3-server response parses successfully, and Jonathan was
> able to find that the 4-server response ends up hitting here
> https://github.com/haproxy/haproxy/blob/master/src/dns.c#L425.
>
>
> I'd be happy for any workaround or explanation of what we could do
> differently, and happy to help get more info, or to try out a patch in our
> environment to confirm a fix if this is a bug as it seems.
>
>
> Jim
>
> ------------------------------
> *From:* Jim Deville
> *Sent:* Tuesday, June 19, 2018 6:00:07 PM
> *To:* haproxy@formilux.org
> *Cc:* Jonathan Works
> *Subject:* Issue with parsing DNS from AWS
>
>
> We have a setup with ECS and AWS's Service Discovery being load balanced
> by HAProxy in order to support sticky sessions for WebSocket handshakes,
> and we're working on making it more efficient by upgrading to 1.8.9 and
> taking advantage of seamless reloads and DNS service discovery. We have a
> solution almost working, however, we're seeing an issue during scaling when
> the DNS response crosses a certain size.
>
>
> We're using the following config (anonymized): https://gist.
> github.com/jredville/523de951d5ab6b60a0d345516bcf46d4
>
> What we're seeing is:
> * if we bring up 3 target servers, they come up as healthy, and traffic
> is routed appropriately. If we restart haproxy, it comes up healthy
> * if we then scale to 4 or more servers, the 4th and additional are
> never recognized, however, the first 3 stay healthy
> * if we restart haproxy with 4 or more servers, no servers come up
> healthy
>
> We've attempted to modify the init-addr setting, accepted_payload_size,
> check options, and we've tried with and without a server-template and this
> is the behavior we consistently see. If we run strace over haproxy, we see
> it making the DNS requests but never updating the state of the servers. At
> this point we're not sure if we have something wrong in config or if there
> is a bug in how haproxy parses responses from AWS. Johnathan (cc'd) has
> pcap's if that would be helpful as well.
>
> Thanks,
> Jim
>



Hi guys,

Thanks for the report and the troubleshooting already done.
Something that would help me a lot, is to be able to reproduce the issue.
2 options from here, either you provide the smallest terraform script which
allows to reproduce the platform or you provide me an access to a temporary
platform so I could troubleshoot live.
(we can carry on this conversation off list of course).

Baptiste
Baptiste
Re: Issue with parsing DNS from AWS
June 21, 2018 05:10PM
and by the way, I had a quick look at the pcap file and could not find
anything weird.
The function you're pointing seem to say there is not enough space to store
a server's dns name, but the allocated space is larger that your current
records.

Baptiste
Jim Deville
Re: Issue with parsing DNS from AWS
June 21, 2018 07:10PM
Thanks for the reply, we were able to extract a minimal repro to demonstrate the problem: https://github.com/jgworks/haproxy-servicediscovery



The docker folder contains a version of the config we're using and a startup script to determine the local private DNS zone (AWS puts it at the subnet's +2).


Jim

________________________________
From: Baptiste <[email protected]>
Sent: Thursday, June 21, 2018 11:02:26 AM
To: Jim Deville
Cc: haproxy@formilux.org; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS

and by the way, I had a quick look at the pcap file and could not find anything weird.
The function you're pointing seem to say there is not enough space to store a server's dns name, but the allocated space is larger that your current records.

Baptiste
Jim Deville
Re: Issue with parsing DNS from AWS
June 25, 2018 11:00PM
Hi Bapiste,


I just wanted to follow up to see if you were able to repro and perhaps had a patch we could try?


Jim

________________________________
From: Jim Deville
Sent: Thursday, June 21, 2018 1:05:49 PM
To: Baptiste
Cc: haproxy@formilux.org; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS


Thanks for the reply, we were able to extract a minimal repro to demonstrate the problem: https://github.com/jgworks/haproxy-servicediscovery



The docker folder contains a version of the config we're using and a startup script to determine the local private DNS zone (AWS puts it at the subnet's +2).


Jim

________________________________
From: Baptiste <[email protected]>
Sent: Thursday, June 21, 2018 11:02:26 AM
To: Jim Deville
Cc: haproxy@formilux.org; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS

and by the way, I had a quick look at the pcap file and could not find anything weird.
The function you're pointing seem to say there is not enough space to store a server's dns name, but the allocated space is larger that your current records.

Baptiste
Baptiste
Re: Issue with parsing DNS from AWS
July 03, 2018 11:50AM
Hi Jim,

Sorry for the long pause :)
I was dealing with some travel, conferences and catching up on my backlog.
So, the good news, is that this issue is now my priority :)

I'll try to first reproduce it and come back to you if I have any issue
during that step.
(by the way, thanks for the github repo to help me speed up in that step).

Baptiste




On Mon, Jun 25, 2018 at 10:54 PM, Jim Deville <[email protected]>
wrote:

> Hi Bapiste,
>
>
> I just wanted to follow up to see if you were able to repro and perhaps
> had a patch we could try?
>
>
> Jim
> ------------------------------
> *From:* Jim Deville
> *Sent:* Thursday, June 21, 2018 1:05:49 PM
> *To:* Baptiste
> *Cc:* haproxy@formilux.org; Jonathan Works
> *Subject:* Re: Issue with parsing DNS from AWS
>
>
> Thanks for the reply, we were able to extract a minimal repro to
> demonstrate the problem: https://github.com/jgworks/haproxy-
> servicediscovery
>
>
> The docker folder contains a version of the config we're using and a
> startup script to determine the local private DNS zone (AWS puts it at the
> subnet's +2).
>
>
> Jim
> ------------------------------
> *From:* Baptiste <[email protected]>
> *Sent:* Thursday, June 21, 2018 11:02:26 AM
> *To:* Jim Deville
> *Cc:* haproxy@formilux.org; Jonathan Works
> *Subject:* Re: Issue with parsing DNS from AWS
>
> and by the way, I had a quick look at the pcap file and could not find
> anything weird.
> The function you're pointing seem to say there is not enough space to
> store a server's dns name, but the allocated space is larger that your
> current records.
>
> Baptiste
>
Baptiste
Re: Issue with parsing DNS from AWS
July 03, 2018 01:10PM
Hi Jim,

I think I have something running...
At least, terraform did not complain and I can see "stuff" in my AWS
dashoard.
Now, I have no idea how I can get connected to my running HAProxy
container, neither how I can troubleshoot what's happening :)

Any help would be (again) appreciated.

Baptiste



On Tue, Jul 3, 2018 at 11:39 AM, Baptiste <[email protected]> wrote:

> Hi Jim,
>
> Sorry for the long pause :)
> I was dealing with some travel, conferences and catching up on my backlog.
> So, the good news, is that this issue is now my priority :)
>
> I'll try to first reproduce it and come back to you if I have any issue
> during that step.
> (by the way, thanks for the github repo to help me speed up in that step).
>
> Baptiste
>
>
>
>
> On Mon, Jun 25, 2018 at 10:54 PM, Jim Deville <[email protected]>
> wrote:
>
>> Hi Bapiste,
>>
>>
>> I just wanted to follow up to see if you were able to repro and perhaps
>> had a patch we could try?
>>
>>
>> Jim
>> ------------------------------
>> *From:* Jim Deville
>> *Sent:* Thursday, June 21, 2018 1:05:49 PM
>> *To:* Baptiste
>> *Cc:* haproxy@formilux.org; Jonathan Works
>> *Subject:* Re: Issue with parsing DNS from AWS
>>
>>
>> Thanks for the reply, we were able to extract a minimal repro to
>> demonstrate the problem: https://github.com/jg
>> works/haproxy-servicediscovery
>>
>>
>> The docker folder contains a version of the config we're using and a
>> startup script to determine the local private DNS zone (AWS puts it at the
>> subnet's +2).
>>
>>
>> Jim
>> ------------------------------
>> *From:* Baptiste <[email protected]>
>> *Sent:* Thursday, June 21, 2018 11:02:26 AM
>> *To:* Jim Deville
>> *Cc:* haproxy@formilux.org; Jonathan Works
>> *Subject:* Re: Issue with parsing DNS from AWS
>>
>> and by the way, I had a quick look at the pcap file and could not find
>> anything weird.
>> The function you're pointing seem to say there is not enough space to
>> store a server's dns name, but the allocated space is larger that your
>> current records.
>>
>> Baptiste
>>
>
>
Baptiste
Re: Issue with parsing DNS from AWS
July 03, 2018 01:30PM
Answering myself... I found my way in the menu to be able to allow port
9000 to read the stats page and to find the public IP associated to my
"app".
That said, I still can't get a shell on the running container, but I think
I found an AWS documentation page for this purpose.

I keep you updated.

On Tue, Jul 3, 2018 at 1:06 PM, Baptiste <be[email protected]> wrote:

> Hi Jim,
>
> I think I have something running...
> At least, terraform did not complain and I can see "stuff" in my AWS
> dashoard.
> Now, I have no idea how I can get connected to my running HAProxy
> container, neither how I can troubleshoot what's happening :)
>
> Any help would be (again) appreciated.
>
> Baptiste
>
>
>
> On Tue, Jul 3, 2018 at 11:39 AM, Baptiste <[email protected]> wrote:
>
>> Hi Jim,
>>
>> Sorry for the long pause :)
>> I was dealing with some travel, conferences and catching up on my backlog.
>> So, the good news, is that this issue is now my priority :)
>>
>> I'll try to first reproduce it and come back to you if I have any issue
>> during that step.
>> (by the way, thanks for the github repo to help me speed up in that step).
>>
>> Baptiste
>>
>>
>>
>>
>> On Mon, Jun 25, 2018 at 10:54 PM, Jim Deville <[email protected]>
>> wrote:
>>
>>> Hi Bapiste,
>>>
>>>
>>> I just wanted to follow up to see if you were able to repro and perhaps
>>> had a patch we could try?
>>>
>>>
>>> Jim
>>> ------------------------------
>>> *From:* Jim Deville
>>> *Sent:* Thursday, June 21, 2018 1:05:49 PM
>>> *To:* Baptiste
>>> *Cc:* haproxy@formilux.org; Jonathan Works
>>> *Subject:* Re: Issue with parsing DNS from AWS
>>>
>>>
>>> Thanks for the reply, we were able to extract a minimal repro to
>>> demonstrate the problem: https://github.com/jg
>>> works/haproxy-servicediscovery
>>>
>>>
>>> The docker folder contains a version of the config we're using and a
>>> startup script to determine the local private DNS zone (AWS puts it at the
>>> subnet's +2).
>>>
>>>
>>> Jim
>>> ------------------------------
>>> *From:* Baptiste <[email protected]>
>>> *Sent:* Thursday, June 21, 2018 11:02:26 AM
>>> *To:* Jim Deville
>>> *Cc:* haproxy@formilux.org; Jonathan Works
>>> *Subject:* Re: Issue with parsing DNS from AWS
>>>
>>> and by the way, I had a quick look at the pcap file and could not find
>>> anything weird.
>>> The function you're pointing seem to say there is not enough space to
>>> store a server's dns name, but the allocated space is larger that your
>>> current records.
>>>
>>> Baptiste
>>>
>>
>>
>
Baptiste
Re: Issue with parsing DNS from AWS
July 03, 2018 03:30PM
Well, I can partially reproduce the issue you're facing and I can see some
weird behavior of AWS's DNS servers.

First, by default, HAProxy only support DNS over UDP and can accept up to
512 bytes of payload in the DNS response.
DNS over TCP is not yet available and accepted payload size can be
increased using EDNS0 extension.

There is a "magic" number of SRV records with AWS and default HAProxy
accepted payload size, at around 4 SRV records, the response payload may be
bigger than 512 bytes.
And so, AWS DNS server does not return any data, simply returns an empty
response, with the TRUNCATED flag.
In such case, a client is supposed to replay the request over TCP...

An other magic value with AWS DNS servers is that it won't return more than
8 SRV records, even if you have 10 servers in your service. (even in TCP)
AWS DNS servers will simply return a round robin list of the records, some
will disappear, some will reappear at some point in time.


Conclusion, to make HAProxy work in such environment, you want to configure
it that way:
resolvers awsdns
nameserver dns0 NAMESERVER:53 # <=== please remove the doule quotes
accepted_payload_size 8192 # <=== workaround for too
short accepted payload
hold obsolete 30s # <=== workaround for
limited number of records returned by AWS

You may want to read the documentation of HAProxy's resolver. There are a
few other timeout / hold period you could tune.

With the configuration above, I could easily scale from 2 to 10, back to 2,
passing through 4, 8, etc... successfully and without any server flapping.
I did not try to go higher than 10. Bear in mind the "hold obsolete" period
is the period during which HAProxy considers a server as available even if
the DNS server did not return it in the SRV record list.

Baptiste







On Tue, Jul 3, 2018 at 1:26 PM, Baptiste <[email protected]> wrote:

> Answering myself... I found my way in the menu to be able to allow port
> 9000 to read the stats page and to find the public IP associated to my
> "app".
> That said, I still can't get a shell on the running container, but I think
> I found an AWS documentation page for this purpose.
>
> I keep you updated.
>
> On Tue, Jul 3, 2018 at 1:06 PM, Baptiste <[email protected]> wrote:
>
>> Hi Jim,
>>
>> I think I have something running...
>> At least, terraform did not complain and I can see "stuff" in my AWS
>> dashoard.
>> Now, I have no idea how I can get connected to my running HAProxy
>> container, neither how I can troubleshoot what's happening :)
>>
>> Any help would be (again) appreciated.
>>
>> Baptiste
>>
>>
>>
>> On Tue, Jul 3, 2018 at 11:39 AM, Baptiste <[email protected]> wrote:
>>
>>> Hi Jim,
>>>
>>> Sorry for the long pause :)
>>> I was dealing with some travel, conferences and catching up on my
>>> backlog.
>>> So, the good news, is that this issue is now my priority :)
>>>
>>> I'll try to first reproduce it and come back to you if I have any issue
>>> during that step.
>>> (by the way, thanks for the github repo to help me speed up in that
>>> step).
>>>
>>> Baptiste
>>>
>>>
>>>
>>>
>>> On Mon, Jun 25, 2018 at 10:54 PM, Jim Deville <[email protected]
>>> > wrote:
>>>
>>>> Hi Bapiste,
>>>>
>>>>
>>>> I just wanted to follow up to see if you were able to repro and perhaps
>>>> had a patch we could try?
>>>>
>>>>
>>>> Jim
>>>> ------------------------------
>>>> *From:* Jim Deville
>>>> *Sent:* Thursday, June 21, 2018 1:05:49 PM
>>>> *To:* Baptiste
>>>> *Cc:* haproxy@formilux.org; Jonathan Works
>>>> *Subject:* Re: Issue with parsing DNS from AWS
>>>>
>>>>
>>>> Thanks for the reply, we were able to extract a minimal repro to
>>>> demonstrate the problem: https://github.com/jg
>>>> works/haproxy-servicediscovery
>>>>
>>>>
>>>> The docker folder contains a version of the config we're using and a
>>>> startup script to determine the local private DNS zone (AWS puts it at the
>>>> subnet's +2).
>>>>
>>>>
>>>> Jim
>>>> ------------------------------
>>>> *From:* Baptiste <[email protected]>
>>>> *Sent:* Thursday, June 21, 2018 11:02:26 AM
>>>> *To:* Jim Deville
>>>> *Cc:* haproxy@formilux.org; Jonathan Works
>>>> *Subject:* Re: Issue with parsing DNS from AWS
>>>>
>>>> and by the way, I had a quick look at the pcap file and could not find
>>>> anything weird.
>>>> The function you're pointing seem to say there is not enough space to
>>>> store a server's dns name, but the allocated space is larger that your
>>>> current records.
>>>>
>>>> Baptiste
>>>>
>>>
>>>
>>
>
Baptiste
Re: Issue with parsing DNS from AWS
July 03, 2018 03:30PM
Ah yes, I also added the following "init-addr none" statement on the
server-template line.
This prevents HAProxy from using libc resolvers, which might end up in
unpredictible behavior in that enviroment....

Baptiste

On Tue, Jul 3, 2018 at 3:18 PM, Baptiste <[email protected]> wrote:

> Well, I can partially reproduce the issue you're facing and I can see some
> weird behavior of AWS's DNS servers.
>
> First, by default, HAProxy only support DNS over UDP and can accept up to
> 512 bytes of payload in the DNS response.
> DNS over TCP is not yet available and accepted payload size can be
> increased using EDNS0 extension.
>
> There is a "magic" number of SRV records with AWS and default HAProxy
> accepted payload size, at around 4 SRV records, the response payload may be
> bigger than 512 bytes.
> And so, AWS DNS server does not return any data, simply returns an empty
> response, with the TRUNCATED flag.
> In such case, a client is supposed to replay the request over TCP...
>
> An other magic value with AWS DNS servers is that it won't return more
> than 8 SRV records, even if you have 10 servers in your service. (even in
> TCP)
> AWS DNS servers will simply return a round robin list of the records, some
> will disappear, some will reappear at some point in time.
>
>
> Conclusion, to make HAProxy work in such environment, you want to
> configure it that way:
> resolvers awsdns
> nameserver dns0 NAMESERVER:53 # <=== please remove the doule quotes
> accepted_payload_size 8192 # <=== workaround for too
> short accepted payload
> hold obsolete 30s # <=== workaround
> for limited number of records returned by AWS
>
> You may want to read the documentation of HAProxy's resolver. There are a
> few other timeout / hold period you could tune.
>
> With the configuration above, I could easily scale from 2 to 10, back to
> 2, passing through 4, 8, etc... successfully and without any server
> flapping.
> I did not try to go higher than 10. Bear in mind the "hold obsolete"
> period is the period during which HAProxy considers a server as available
> even if the DNS server did not return it in the SRV record list.
>
> Baptiste
>
>
>
>
>
>
>
> On Tue, Jul 3, 2018 at 1:26 PM, Baptiste <[email protected]> wrote:
>
>> Answering myself... I found my way in the menu to be able to allow port
>> 9000 to read the stats page and to find the public IP associated to my
>> "app".
>> That said, I still can't get a shell on the running container, but I
>> think I found an AWS documentation page for this purpose.
>>
>> I keep you updated.
>>
>> On Tue, Jul 3, 2018 at 1:06 PM, Baptiste <[email protected]> wrote:
>>
>>> Hi Jim,
>>>
>>> I think I have something running...
>>> At least, terraform did not complain and I can see "stuff" in my AWS
>>> dashoard.
>>> Now, I have no idea how I can get connected to my running HAProxy
>>> container, neither how I can troubleshoot what's happening :)
>>>
>>> Any help would be (again) appreciated.
>>>
>>> Baptiste
>>>
>>>
>>>
>>> On Tue, Jul 3, 2018 at 11:39 AM, Baptiste <[email protected]> wrote:
>>>
>>>> Hi Jim,
>>>>
>>>> Sorry for the long pause :)
>>>> I was dealing with some travel, conferences and catching up on my
>>>> backlog.
>>>> So, the good news, is that this issue is now my priority :)
>>>>
>>>> I'll try to first reproduce it and come back to you if I have any issue
>>>> during that step.
>>>> (by the way, thanks for the github repo to help me speed up in that
>>>> step).
>>>>
>>>> Baptiste
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 25, 2018 at 10:54 PM, Jim Deville <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Bapiste,
>>>>>
>>>>>
>>>>> I just wanted to follow up to see if you were able to repro and
>>>>> perhaps had a patch we could try?
>>>>>
>>>>>
>>>>> Jim
>>>>> ------------------------------
>>>>> *From:* Jim Deville
>>>>> *Sent:* Thursday, June 21, 2018 1:05:49 PM
>>>>> *To:* Baptiste
>>>>> *Cc:* haproxy@formilux.org; Jonathan Works
>>>>> *Subject:* Re: Issue with parsing DNS from AWS
>>>>>
>>>>>
>>>>> Thanks for the reply, we were able to extract a minimal repro to
>>>>> demonstrate the problem: https://github.com/jg
>>>>> works/haproxy-servicediscovery
>>>>>
>>>>>
>>>>> The docker folder contains a version of the config we're using and a
>>>>> startup script to determine the local private DNS zone (AWS puts it at the
>>>>> subnet's +2).
>>>>>
>>>>>
>>>>> Jim
>>>>> ------------------------------
>>>>> *From:* Baptiste <[email protected]>
>>>>> *Sent:* Thursday, June 21, 2018 11:02:26 AM
>>>>> *To:* Jim Deville
>>>>> *Cc:* haproxy@formilux.org; Jonathan Works
>>>>> *Subject:* Re: Issue with parsing DNS from AWS
>>>>>
>>>>> and by the way, I had a quick look at the pcap file and could not find
>>>>> anything weird.
>>>>> The function you're pointing seem to say there is not enough space to
>>>>> store a server's dns name, but the allocated space is larger that your
>>>>> current records.
>>>>>
>>>>> Baptiste
>>>>>
>>>>
>>>>
>>>
>>
>
Jim Deville
Re: Issue with parsing DNS from AWS
July 05, 2018 04:00PM
Hi Baptiste,


I appreciate you taking time for this, we had tried increasing the response size, but I believe we left hold obsolete at defaults and that probably lead to flapping. How often does HAProxy re-poll DNS for this? I'm curious what limits this really sets for how many servers we can scale to with this. Also, will DNS over TCP help any? Seems like it still needs roughly the same settings given the round-robin responses.


In the meantime, we will look into these settings to see if we can make them work as well.


Jim

________________________________
From: Baptiste <[email protected]>
Sent: Tuesday, July 3, 2018 9:20:53 AM
To: Jim Deville
Cc: haproxy@formilux.org; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS

Ah yes, I also added the following "init-addr none" statement on the server-template line.
This prevents HAProxy from using libc resolvers, which might end up in unpredictible behavior in that enviroment....

Baptiste

On Tue, Jul 3, 2018 at 3:18 PM, Baptiste <[email protected]<mailto:[email protected]>> wrote:
Well, I can partially reproduce the issue you're facing and I can see some weird behavior of AWS's DNS servers.

First, by default, HAProxy only support DNS over UDP and can accept up to 512 bytes of payload in the DNS response.
DNS over TCP is not yet available and accepted payload size can be increased using EDNS0 extension.

There is a "magic" number of SRV records with AWS and default HAProxy accepted payload size, at around 4 SRV records, the response payload may be bigger than 512 bytes.
And so, AWS DNS server does not return any data, simply returns an empty response, with the TRUNCATED flag.
In such case, a client is supposed to replay the request over TCP...

An other magic value with AWS DNS servers is that it won't return more than 8 SRV records, even if you have 10 servers in your service. (even in TCP)
AWS DNS servers will simply return a round robin list of the records, some will disappear, some will reappear at some point in time.


Conclusion, to make HAProxy work in such environment, you want to configure it that way:
resolvers awsdns
nameserver dns0 NAMESERVER:53 # <=== please remove the doule quotes
accepted_payload_size 8192 # <=== workaround for too short accepted payload
hold obsolete 30s # <=== workaround for limited number of records returned by AWS

You may want to read the documentation of HAProxy's resolver. There are a few other timeout / hold period you could tune.

With the configuration above, I could easily scale from 2 to 10, back to 2, passing through 4, 8, etc... successfully and without any server flapping.
I did not try to go higher than 10. Bear in mind the "hold obsolete" period is the period during which HAProxy considers a server as available even if the DNS server did not return it in the SRV record list.

Baptiste







On Tue, Jul 3, 2018 at 1:26 PM, Baptiste <[email protected]<mailto:[email protected]>> wrote:
Answering myself... I found my way in the menu to be able to allow port 9000 to read the stats page and to find the public IP associated to my "app".
That said, I still can't get a shell on the running container, but I think I found an AWS documentation page for this purpose.

I keep you updated.

On Tue, Jul 3, 2018 at 1:06 PM, Baptiste <[email protected]<mailto:[email protected]>> wrote:
Hi Jim,

I think I have something running...
At least, terraform did not complain and I can see "stuff" in my AWS dashoard.
Now, I have no idea how I can get connected to my running HAProxy container, neither how I can troubleshoot what's happening :)

Any help would be (again) appreciated.

Baptiste



On Tue, Jul 3, 2018 at 11:39 AM, Baptiste <[email protected]<mailto:[email protected]>> wrote:
Hi Jim,

Sorry for the long pause :)
I was dealing with some travel, conferences and catching up on my backlog.
So, the good news, is that this issue is now my priority :)

I'll try to first reproduce it and come back to you if I have any issue during that step.
(by the way, thanks for the github repo to help me speed up in that step).

Baptiste




On Mon, Jun 25, 2018 at 10:54 PM, Jim Deville <[email protected]<mailto:[email protected]>> wrote:

Hi Bapiste,


I just wanted to follow up to see if you were able to repro and perhaps had a patch we could try?


Jim

________________________________
From: Jim Deville
Sent: Thursday, June 21, 2018 1:05:49 PM
To: Baptiste
Cc: [email protected]<mailto:[email protected]>; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS


Thanks for the reply, we were able to extract a minimal repro to demonstrate the problem: https://github.com/jgworks/haproxy-servicediscovery



The docker folder contains a version of the config we're using and a startup script to determine the local private DNS zone (AWS puts it at the subnet's +2).


Jim

________________________________
From: Baptiste <[email protected]<mailto:[email protected]>>
Sent: Thursday, June 21, 2018 11:02:26 AM
To: Jim Deville
Cc: [email protected]<mailto:[email protected]>; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS

and by the way, I had a quick look at the pcap file and could not find anything weird.
The function you're pointing seem to say there is not enough space to store a server's dns name, but the allocated space is larger that your current records.

Baptiste
Baptiste
Re: Issue with parsing DNS from AWS
July 12, 2018 03:10PM
Hi Jim,

"hold obsolete" defaults to 0, so basically, HAProxy may evince servers
from your backend quite frequently (the bigger the farm, the more chance it
happens).
Furthermore, most of those changes are "false positive" (since the server
may still be healthy).

DNS over TCP won't help.
As I stated in my previous mail, AWS DNS servers only returns 8 records per
response (they are "roundrobined"), even in TCP (I did try with "drill" DNS
client).
So, your only way to go is to use the "hold obsolete" timer.


On Thu, Jul 5, 2018 at 3:49 PM, Jim Deville <[email protected]>
wrote:

> Hi Baptiste,
>
>
> I appreciate you taking time for this, we had tried increasing the
> response size, but I believe we left hold obsolete at defaults and that
> probably lead to flapping. How often does HAProxy re-poll DNS for this? I'm
> curious what limits this really sets for how many servers we can scale to
> with this. Also, will DNS over TCP help any? Seems like it still needs
> roughly the same settings given the round-robin responses.
>
>
> In the meantime, we will look into these settings to see if we can make
> them work as well.
>
>
> Jim
> ------------------------------
> *From:* Baptiste <[email protected]>
> *Sent:* Tuesday, July 3, 2018 9:20:53 AM
>
> *To:* Jim Deville
> *Cc:* haproxy@formilux.org; Jonathan Works
> *Subject:* Re: Issue with parsing DNS from AWS
>
> Ah yes, I also added the following "init-addr none" statement on the
> server-template line.
> This prevents HAProxy from using libc resolvers, which might end up in
> unpredictible behavior in that enviroment....
>
> Baptiste
>
> On Tue, Jul 3, 2018 at 3:18 PM, Baptiste <[email protected]> wrote:
>
> Well, I can partially reproduce the issue you're facing and I can see some
> weird behavior of AWS's DNS servers.
>
> First, by default, HAProxy only support DNS over UDP and can accept up to
> 512 bytes of payload in the DNS response.
> DNS over TCP is not yet available and accepted payload size can be
> increased using EDNS0 extension.
>
> There is a "magic" number of SRV records with AWS and default HAProxy
> accepted payload size, at around 4 SRV records, the response payload may be
> bigger than 512 bytes.
> And so, AWS DNS server does not return any data, simply returns an empty
> response, with the TRUNCATED flag.
> In such case, a client is supposed to replay the request over TCP...
>
> An other magic value with AWS DNS servers is that it won't return more
> than 8 SRV records, even if you have 10 servers in your service. (even in
> TCP)
> AWS DNS servers will simply return a round robin list of the records, some
> will disappear, some will reappear at some point in time.
>
>
> Conclusion, to make HAProxy work in such environment, you want to
> configure it that way:
> resolvers awsdns
> nameserver dns0 NAMESERVER:53 # <=== please remove the doule quotes
> accepted_payload_size 8192 # <=== workaround for too
> short accepted payload
> hold obsolete 30s # <=== workaround
> for limited number of records returned by AWS
>
> You may want to read the documentation of HAProxy's resolver. There are a
> few other timeout / hold period you could tune.
>
> With the configuration above, I could easily scale from 2 to 10, back to
> 2, passing through 4, 8, etc... successfully and without any server
> flapping.
> I did not try to go higher than 10. Bear in mind the "hold obsolete"
> period is the period during which HAProxy considers a server as available
> even if the DNS server did not return it in the SRV record list.
>
> Baptiste
>
>
>
>
>
>
>
> On Tue, Jul 3, 2018 at 1:26 PM, Baptiste <[email protected]> wrote:
>
> Answering myself... I found my way in the menu to be able to allow port
> 9000 to read the stats page and to find the public IP associated to my
> "app".
> That said, I still can't get a shell on the running container, but I think
> I found an AWS documentation page for this purpose.
>
> I keep you updated.
>
> On Tue, Jul 3, 2018 at 1:06 PM, Baptiste <[email protected]> wrote:
>
> Hi Jim,
>
> I think I have something running...
> At least, terraform did not complain and I can see "stuff" in my AWS
> dashoard.
> Now, I have no idea how I can get connected to my running HAProxy
> container, neither how I can troubleshoot what's happening :)
>
> Any help would be (again) appreciated.
>
> Baptiste
>
>
>
> On Tue, Jul 3, 2018 at 11:39 AM, Baptiste <[email protected]> wrote:
>
> Hi Jim,
>
> Sorry for the long pause :)
> I was dealing with some travel, conferences and catching up on my backlog.
> So, the good news, is that this issue is now my priority :)
>
> I'll try to first reproduce it and come back to you if I have any issue
> during that step.
> (by the way, thanks for the github repo to help me speed up in that step).
>
> Baptiste
>
>
>
>
> On Mon, Jun 25, 2018 at 10:54 PM, Jim Deville <[email protected]>
> wrote:
>
> Hi Bapiste,
>
>
> I just wanted to follow up to see if you were able to repro and perhaps
> had a patch we could try?
>
>
> Jim
> ------------------------------
> *From:* Jim Deville
> *Sent:* Thursday, June 21, 2018 1:05:49 PM
> *To:* Baptiste
> *Cc:* haproxy@formilux.org; Jonathan Works
> *Subject:* Re: Issue with parsing DNS from AWS
>
>
> Thanks for the reply, we were able to extract a minimal repro to
> demonstrate the problem: https://github.com/jg
> works/haproxy-servicediscovery
>
>
> The docker folder contains a version of the config we're using and a
> startup script to determine the local private DNS zone (AWS puts it at the
> subnet's +2).
>
>
> Jim
> ------------------------------
> *From:* Baptiste <[email protected]>
> *Sent:* Thursday, June 21, 2018 11:02:26 AM
> *To:* Jim Deville
> *Cc:* haproxy@formilux.org; Jonathan Works
> *Subject:* Re: Issue with parsing DNS from AWS
>
> and by the way, I had a quick look at the pcap file and could not find
> anything weird.
> The function you're pointing seem to say there is not enough space to
> store a server's dns name, but the allocated space is larger that your
> current records.
>
> Baptiste
>
>
>
>
>
>
>
Jim Deville
Re: Issue with parsing DNS from AWS
July 12, 2018 04:40PM
Thanks for the update. We will see what we can do, and I appreciate your help!


Jim

________________________________
From: Baptiste <[email protected]>
Sent: Thursday, July 12, 2018 8:59:53 AM
To: Jim Deville
Cc: haproxy@formilux.org; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS

Hi Jim,

"hold obsolete" defaults to 0, so basically, HAProxy may evince servers from your backend quite frequently (the bigger the farm, the more chance it happens).
Furthermore, most of those changes are "false positive" (since the server may still be healthy).

DNS over TCP won't help.
As I stated in my previous mail, AWS DNS servers only returns 8 records per response (they are "roundrobined"), even in TCP (I did try with "drill" DNS client).
So, your only way to go is to use the "hold obsolete" timer.


On Thu, Jul 5, 2018 at 3:49 PM, Jim Deville <[email protected]<mailto:[email protected]>> wrote:

Hi Baptiste,


I appreciate you taking time for this, we had tried increasing the response size, but I believe we left hold obsolete at defaults and that probably lead to flapping. How often does HAProxy re-poll DNS for this? I'm curious what limits this really sets for how many servers we can scale to with this. Also, will DNS over TCP help any? Seems like it still needs roughly the same settings given the round-robin responses.


In the meantime, we will look into these settings to see if we can make them work as well.


Jim

________________________________
From: Baptiste <[email protected]<mailto:[email protected]>>
Sent: Tuesday, July 3, 2018 9:20:53 AM

To: Jim Deville
Cc: [email protected]<mailto:[email protected]>; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS

Ah yes, I also added the following "init-addr none" statement on the server-template line.
This prevents HAProxy from using libc resolvers, which might end up in unpredictible behavior in that enviroment....

Baptiste

On Tue, Jul 3, 2018 at 3:18 PM, Baptiste <[email protected]<mailto:[email protected]>> wrote:
Well, I can partially reproduce the issue you're facing and I can see some weird behavior of AWS's DNS servers.

First, by default, HAProxy only support DNS over UDP and can accept up to 512 bytes of payload in the DNS response.
DNS over TCP is not yet available and accepted payload size can be increased using EDNS0 extension.

There is a "magic" number of SRV records with AWS and default HAProxy accepted payload size, at around 4 SRV records, the response payload may be bigger than 512 bytes.
And so, AWS DNS server does not return any data, simply returns an empty response, with the TRUNCATED flag.
In such case, a client is supposed to replay the request over TCP...

An other magic value with AWS DNS servers is that it won't return more than 8 SRV records, even if you have 10 servers in your service. (even in TCP)
AWS DNS servers will simply return a round robin list of the records, some will disappear, some will reappear at some point in time.


Conclusion, to make HAProxy work in such environment, you want to configure it that way:
resolvers awsdns
nameserver dns0 NAMESERVER:53 # <=== please remove the doule quotes
accepted_payload_size 8192 # <=== workaround for too short accepted payload
hold obsolete 30s # <=== workaround for limited number of records returned by AWS

You may want to read the documentation of HAProxy's resolver. There are a few other timeout / hold period you could tune.

With the configuration above, I could easily scale from 2 to 10, back to 2, passing through 4, 8, etc... successfully and without any server flapping.
I did not try to go higher than 10. Bear in mind the "hold obsolete" period is the period during which HAProxy considers a server as available even if the DNS server did not return it in the SRV record list.

Baptiste







On Tue, Jul 3, 2018 at 1:26 PM, Baptiste <[email protected]<mailto:[email protected]>> wrote:
Answering myself... I found my way in the menu to be able to allow port 9000 to read the stats page and to find the public IP associated to my "app".
That said, I still can't get a shell on the running container, but I think I found an AWS documentation page for this purpose.

I keep you updated.

On Tue, Jul 3, 2018 at 1:06 PM, Baptiste <[email protected]<mailto:[email protected]>> wrote:
Hi Jim,

I think I have something running...
At least, terraform did not complain and I can see "stuff" in my AWS dashoard.
Now, I have no idea how I can get connected to my running HAProxy container, neither how I can troubleshoot what's happening :)

Any help would be (again) appreciated.

Baptiste



On Tue, Jul 3, 2018 at 11:39 AM, Baptiste <[email protected]<mailto:[email protected]>> wrote:
Hi Jim,

Sorry for the long pause :)
I was dealing with some travel, conferences and catching up on my backlog.
So, the good news, is that this issue is now my priority :)

I'll try to first reproduce it and come back to you if I have any issue during that step.
(by the way, thanks for the github repo to help me speed up in that step).

Baptiste




On Mon, Jun 25, 2018 at 10:54 PM, Jim Deville <[email protected]<mailto:[email protected]>> wrote:

Hi Bapiste,


I just wanted to follow up to see if you were able to repro and perhaps had a patch we could try?


Jim

________________________________
From: Jim Deville
Sent: Thursday, June 21, 2018 1:05:49 PM
To: Baptiste
Cc: [email protected]<mailto:[email protected]>; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS


Thanks for the reply, we were able to extract a minimal repro to demonstrate the problem: https://github.com/jgworks/haproxy-servicediscovery



The docker folder contains a version of the config we're using and a startup script to determine the local private DNS zone (AWS puts it at the subnet's +2).


Jim

________________________________
From: Baptiste <[email protected]<mailto:[email protected]>>
Sent: Thursday, June 21, 2018 11:02:26 AM
To: Jim Deville
Cc: [email protected]<mailto:[email protected]>; Jonathan Works
Subject: Re: Issue with parsing DNS from AWS

and by the way, I had a quick look at the pcap file and could not find anything weird.
The function you're pointing seem to say there is not enough space to store a server's dns name, but the allocated space is larger that your current records.

Baptiste
Sorry, only registered users may post in this forum.

Click here to login