Conversation
Right now we process messages with a queue, and discard any that don't match the id that we're waiting for. This means that messages can't be processed out of order, which can happen with a multithreaded architecture. I've refactored it to use a synchronized hash rather than a queue, which allows us to retrieve messages in any order.
|
@iainbeeston yea the code is tricky. let me explain why it's written like this. |
|
@iainbeeston I started to write it down but it looks like it's more complicated than I expected to find words, I think I'll come up with some well thought explanation to the protocol and current acrhitecture. Considering the issue with CI, yes I was scratching the head too why so many intermittent issues while I have none on my machine or mac. I just a few minutes ago came up with a solution as it looks like I found the real issue, though it's not concurrency issue :).
That's just another type of issue. It's specific Chromium version which for some reason doesn't crash the browser I updated Chromium to the latest version (there's link in README) and it went away. |
|
@route If there's a good reason for using the queue, then please don't feel the need to explain it! I will close this PR |
|
@iainbeeston I opened #36 let's see how it goes there. I think everyone will benefit if I explain current arch and maybe propose a better solution I don't feel confident enough. |
ISSUE
While trying to understand why there are so many intermittent build errors, I started reading the multithreading code in the gem, and I noticed some odd code in
Capybara::Cuprite::Browser::Client. This class is responsible for sending and receiving messages from the web-socket to Chrome.This is how I understand the current implementation (in master). When a message is sent to chrome, it is sent over the websocket, get an id back, then wait for a response (over the websocket) that has the same id. Messages are received asynchronously via a thread, which puts messages onto a queue. Then the queue is read by popping each item off until a message with the correct id is found, at which point the caller is returned the message. If there are no messages on the queue, the caller waits (with a timeout) until there is one, then pop it, and return it if it has the correct id.
However, I think that there are several issues with this:
In addition, the way the code is implemented is that it iterates through the queue by raising an error every time the id of the message received does not match the id of the message being waited for. The error is caught and the method re-run. The code is using errors for control flow, which is a bit of an antipattern, and is also slow in Ruby.
SOLUTION
I've re-implemented this algorithm to use a hash rather than a queue. Now, when messages are received they are put into a hash keyed by the message id. When the caller is waiting for a message with a specific id, that key is deleted from the hash and returned it if it exists, which removes the need for iteration and only the message that the caller needs is removed (leaving other messages unmodified for other callers to receive). If there is nothing in the hash with that key then the code waits. Every time a new message is received, all waiting threads are woken up (using a condition variable) and re-check for the message that they need. If nothing is found within the timeout then an error is raised.
I've made an assumption here, which is that messages always have unique ids. So far as I can tell that's true.
Unfortunately this hasn't made a big difference to the number of passing tests 😞 but I thought it was worth sharing the code anyway, because I think it should be more reliable. Also, the code I have here doesn't seem to be able to distinguish between a browser crash and a timeout due to no response from Chrome, whereas the old code could. I've tried to debug it but I'm not even sure how the old code could tell the difference! Näively I'd expect the websocket to close on a browser crash but perhaps not.
I hope this is helpful!