cancel
Showing results for 
Search instead for 
Did you mean: 

Spam: Critical Path Learning

Spam: Critical Path Learning

Spam: Critical Path Learning

*Please note that the Critical Path trial has now ended, so the infromation below should be read in context* If you're reading this blog-post then there's a good chance that you've already seen the Service Status announcement that's been published about the work we'll be doing on our email platform next week? For those that haven't though, you can see a basic overview of the work here. For those about to continue reading, be warned as this is a fairly lengthy post and not for the faint-of-heart! (although hopefully you'll find the information it contains useful!) Since the Webmail Incident we've been working hard to improve our spam detection capabilities. There's been the new Manage My Mail API, the ability to turn off email to virtual domains, improved spam detection rates, and more intuitive handling of spam messages at the server level to name but a few. Whilst these things have certainly helped, they all still tie up resources across our mail delivery platform. We routinely see problems with email delays and more often than not it's due to issues that have stemmed from the sheer amount of (often junk) email our mail servers are having to process and deliver. We've attempted ACL blocking, made a multitude of Exim configuration changes and altered/upgraded our spam/virus processing. We've been fighting with the mail platform for too long now and we're only too aware of the negative impact the ensuing problems are having on our customers. Spam isn't going to stop. In fact far from it, it's going to get worse. If the previous years are anything to go by then as we approach Christmas things are going to get particularly nasty. We're already seeing a significant rise in the volumes of spam reported and we absolutely must take proactive steps to avoid the worst happening. As has been mentioned in the Planned Maintenance announcement, we're going to be re-deploying the Critical Path appliances in front of the customer mail platform next week. This will form part of a trial that is expected to last at least three weeks if successful. In addition to re-trialling Critical Path, we're also continuing to look at alternative/additional solutions. Whilst Critical Path may well become a permanent thing, it does not mean we are bound to exclusively using Critical Path for spam protection and does not deter us from the work we're doing elsewhere. Now it's no secret that we have twice before attempted to introduce the Critical Path anti-abuse appliances in front of the customer mail platform and on both occasions our efforts have resulted in negative repercussions for our customers. The first time we ended up losing emails and the second time we were chastised for poor advance communication and the subsequent email delays that arose. It's very important to note that the problems we encountered back then were mainly caused by the interaction between Critical Path's equipment and ours, failure to follow procedural guidelines and a poorly defined set of roll-back criteria. We've been working very hard over the last month alongside Critical Path's most senior technical staff and we're now confident that we have fully addressed and overcompensated for the things that bit us last time. We've very much got the customer at the centre of all of this and we'll be rolling any changes back at the first hint of any trouble. So what exactly happened last time? OK, it makes sense at this point to elaborate on what caused the problems last time. This will help you understand what we've done to safeguard against similar things happening again. The main problems with the previous implementations can be summarised as follows:

  • The PlusNet Mail servers, tuned for Internet access, were tar pitting the Critical Path servers.
  • Emails our servers were detecting as spam were being bounced back to the Critical Path boxes. These messages began queuing on the Critical Path appliances which made it very hard to diagnose issues as they were reported.
  • The PlusNet mail servers were incorrectly handling connection limiting between ourselves and authorised hosts; i.e. the Critical Path devices.
  • The Critical Path server, when presented with a large number of available connections, did not scale out sideways as well as expected.
  • Even though we were pipelining emails between the servers, whenever an email with a spam was detected by the PlusNet servers, and a 550 was returned, the Critical Path machine tore down the connection and it took several seconds to re-establish.
  • We made a change on-the-fly to address the spam rejection issues that resulted in customers' emails getting inadvertently deleted.
Critical Path were on site during the last trial and they saw the pain that was born from the problems that were encountered. They left that day with a conviction to help us resolve what had gone wrong, and as has already been mentioned we've been working closely alongside their most senior platform architects ever since. How are we going to make sure it doesn't happen again? We've been careful to ensure that all of the above points have been addressed as follows:
  • Made some configuration changes to optimise the handling of connections by both the load balancers and the mail delivery servers.
  • Fixed the rejection of spam messages by handing these off to isolated relay servers that will manage the failed delivery reports, allow us to monitor the queue more carefully and more importantly keep it separate from the Critical Path boxes.
  • Configured Critical Path as an authorised host to prevent the tar pitting problems.
  • Tested all fixes using one of the Critical Path appliances and a single mx.core mail delivery server in an isolated environment.
  • Prepared a full roll-out plan detailing decision points and criteria to influence the decision to roll-back.
  • Reinforced a strict change control policy preventing unplanned remedial work from being carried out on the platform. A roll-back will be favoured in this situation.
The above changes have been tested by both ourselves and Critical Path and both parties are confident that the issues have been resolved. Last week we also performed a full stress-test on a single sunmxcore mail server in an isolated environment. During this test 750,000 emails were successfully processed during a three hour period. None of the aforementioned issues were encountered. On average a single sunmxcore server in it's present state will process approximately 1.2 million emails a day. If you consider what we achieved during the above test then you should have an idea as to why we're so eager for this to work. During testing, we also managed to max the CPU on the sunmxcore (there was still plenty of processing potential remaining on the Critical Path appliance). We managed 240 concurrent connections. We only managed 8 the last time we implemented these changes so this is a good indication that there are no longer issues feeding messages from the CP appliances to our platform. The roll-out The roll-out is currently scheduled for Tuesday next week (30th October) and will last for several days dependent on whether or not certain success criteria are met. We will start by replacing one mx.core with a Critical Path device. All traffic from this device will be routed to the removed mx.core server which will then handle the final delivery. After the first server goes live the platform will be closely monitored. Graphs showing the latency and queues on the Critical Path devices alongside the queues on the sunmxcores will be made available to customers via an isolated portal page that will be visible here following the roll-out. If all success criteria are met and no problems are encountered then we will introduce a second server on Wednesday, a third server on Thursday and a fourth on Friday. Once we have reached this point, a decision will be made regarding our deployment to the remaining servers the following week (there are 22 servers in total). No more servers will be added over the weekend and there will be a dedicated resource monitoring the platform throughout this time. There will be a Critical Path employee on site throughout the trial, and we will also be in contact with a further two senior engineers based in Germany and Ireland. Roll-back A decision to roll-back will be arrived at should any of the following criteria be met:
  • The average latency for an email within the Critical Path appliances is greater than 1 minute over a ten minute period.
  • The pending queue on the Critical Path appliance is greater than 10,000, and increases by more than 1,000 every 5 minutes.
The proposed maintenance work that will be carried out should any of these conditions be met is as follows:
  • Drop the maximum number of connections to the MAA through the load balancer by increments of 50. If this gets as low as 200, and the problem still exists after 20 minutes then a full roll- back will be initiated.
  • Before rolling back drop to 100 connections.*
  • Drop, or rise by an increment of 50 after 5 minutes.*
  • Effect complete roll-back after 15 minutes
  • Configure the Critical Path boxes to drain via the mx.last servers to ensure any queues are dissipated as quickly as possible.
* These steps are to allow for the collation of statistics for post roll-out analysis. The values above are not arbitrary as it took just one hour for a single Critical Path appliance to accumulate a queue of 100,000 emails the last time we rolled it to the live platform. By taking such a cautious, staged approach we're hoping to protect customers. What will the Critical Path boxes do? There a a number of things the Critical Path boxes will do once they are live in front of the mail delivery servers:
  • Perform sender-verify checks - sender-verify is currently the duty of the sunmxcores. It involves checking the envelope sender address of each email that is received and ensuring that there are valid mail exchanger records associated with that domain. If none are found then the email is rejected. Critical Path dealing with this aspect of the mail transaction means that the resource normally required to perform all the DNS lookups is removed from our mail servers - It has been long suspected that this has caused the occasional email delay.
  • Any email that the Critical Path boxes identify as spam will be handled in accordance with customers' existing anti-spam preferences. It will be either deleted at source, tagged as [-SPAM-] and delivered to customers' mailboxes, tagged as [-SPAM-] and delivered to customers' 'Spam' folders or not tagged at all. Customers can check their anti-spam preferences using the Manage My Mail tool found in the Member Centre.
  • The headers of the email will show that it has come via the Critical Path appliance.
Risks? There are two risks associated with this work that are worth mentioning. These are what we based our roll-back criteria on and are the reason we've allowed for tweaking of the connection limit in the load balancer as part of the test plan.
  • During the stress-testing we were deleting messages as opposed to delivering them. This negates the load that the mx.core platform would normally come under and is therefore different to how things would be in the live environment - The reason we did this was to push the servers to their maximum and allow the CPU to hit 100%. It's worth noting that we would only expect there to be 40% of the messages we processed during the test when in the live environment.
  • The way the load balancers work is to look for a sunmxcore server (there are 22) with available sessions. The Critical Path devices are much better equipped to handle incoming connections so there is a risk that the appliance may get swamped with connection requests from the load balancers - It's at this point that we would start tweaking the connection limit in the load balancers.
What next? As previously mentioned, we're still exploring the possibility of using other vendors/suppliers. We've been working with a number of other third parties and hope to announce details regarding future trials before long. We're all hoping for a successful roll-out next week and are confident we've done all we can to safeguard our customers from any potential upset. Ultimately we hope this work proves to be a large step towards overcoming the problems spam email causes us and stabilising the platform for our customers once more. Any questions, feedback or concerns regarding this work are welcomed as always over on our Community Site discussion forums. Regards, Bob Pullen.

0 Thanks
2 Comments
892 Views
2 Comments