Friday, August 1, 2014

Lessons, Surprises, Mistakes I. - Internet Connection

This is the first of several posts about my mistakes, surprises and lessons learned. I'm always eager to see other's experiences so I can learn from them. Here are mines so you can learn from me.

Going Live
1st April morning. It kind of works. Most kiosks are online and I see signed in cashiers and some transactions. Some kiosks are offline. I guessed internet connection problems.

It was very similar during the next couple of days. There were rare moments when all kiosks were online. Most of the time at least one kiosk was offline. We discussed the cause of the issues and it was clear that the culprit was the internet connection. The mobile internet is ***. It disconnects every now and then, it drops packets, it is sloooooow. Especially in the kiosk which is a big metal box. I had to mitigate the problems.

Status emails
First step was to have better visibility of the problems. Having status page with kiosks overview was nice but it was not enough - nobody periodically checked it. That's why I've implemented status emails. AMS offers scheduled jobs and I use them to periodically check kiosk statuses and send warning email with problematic kiosks. Initially, the check was every 30 minutes but there were too many emails and nobody cared about them. Currently, the status is checked once per hour during the opening hours and a warning is sent only when kiosk does not ping for more than one hour.

Automatic router restarts
We've discovered the biggest issue in routers. They disconnect from the internet quite often. Luckily for us, routers have monitoring functionality. They can ping specified servers and if no reply comes back, they restart themselves. Once turned on (with 30 minutes threshold), the internet reliability increased dramatically.

Inability to sign in/out
Pages in Durandal are stored in separate files. That means there has to be internet connection to successfully navigate to another page (e.g. starting/ending shift). I thought (but not tested) that browser would cache all pages. But I was wrong and with such bad internet connection, some cashiers could not sign in or out - the screen just turned white because the target page could not be loaded.

The solution is to build the whole app into three files (index.html, index.css, and main.js) which are loaded at once during POS initialization. To do that, I use grunt-durandal and grunt-uncss tasks when creating release package. The only communication then is sending data to server which is backed up in local storage.

Double transactions
Another sign of bad internet connection is double transactions. POS sends data, data is written into database, but the confirmation is lost on a way back and POS sends data again. I have actually seen several triple transactions in the database. To solve this issue, I have added a check before inserting a new data into database.

It did not fully mitigated the issue though. I'm still seeing some duplicate transactions. AMS provides createdAt column and according to this column these duplicate transactions are usually milliseconds apart or even at the same time. I have no idea how these duplicate transactions are created. It can be strange network behavior, bug in AMS…

Success transaction visibility
This is a small tweak. POS displays whether the last server operation was successful or not. It's not for cashiers because they don't care. It's mainly for company employees when they check the kiosk on site to know that everything is OK.

Summary
After implementing above fixes, the reliability is quite good. It is rare to see offline kiosk. There is still some work I'd like to do though. For example, I do not properly handle authentication token expiration so somebody has to manually re-login kiosks every ~30 days otherwise they would be offline because of missing authentication.

No comments: