The following is a statement received from PlusNet with regards to the issues and problesm they experienced last week.
Overview of network issues
On behalf of the management team at PlusNet, I would like to sincerely
apologise for the inconvenience caused due to problems experienced on
our broadband network last week.
During last weekend and again later in the week, we saw intermittent
issues with one of our network suppliers. This may have caused
difficulties in accessing some Internet sites. Additionally, we experienced a
major network failure on the morning of Tuesday 7th March. This caused
loss of connectivity for customers over several hours and resulted in very
high call volumes to our customer service centre, resulting in some
customers being unable to make contact with us.
Please accept our apologies and be assured that our team is working
tirelessly to protect your customer experience. Through careful planning
and investments in our network, we know that we can quickly improve on
the problems experienced this week and ensure that they do not occur
again. In the meantime, we have included below further information about
what went wrong last week, how we fixed it and what we’re doing to
minimise the chances of future related issues.
Head of Customer Services
PlusNet – The Smarter Way to Broadband
Service Details – Network issues 3rd – 10th March
Tuesday 7th March – RADIUS Problems
Early last Tuesday morning, we completed routine upgrades on our core
network and started the process of re-authenticating all our customers
onto our Broadband platform.
When we perform these types of software upgrades, we need to ensure our
entire network behaves consistently following the work. This
occasionally means we have no option but to affect all of our customers who are
connected at that time. This is a procedure that has been carried out
regularly in the past without issues.
The re-authentication process involves disconnecting all customers
connected through each of our pipes in a staggered manner, waiting for the
customers from one pipe to reconnect to the network, and then starting
the next. Connection attempts are monitored and authorised by our
RADIUS platform. The problem occurred during the re-authentication of
customers from the final BT Central pipe, which is currently operating as
part of a BT Wholesale trial and has more customers connected than all
other similar pieces of equipment on our network.
We operate two distinct RADIUS platforms, each one made up of numerous
servers, some of which are dedicated to different tasks (Database,
Front-end, Accounting etc). When we went to re-connect the customers on our
most heavily subscribed Central Pipe (At that time in the morning, this
was approaching 30,000 live user sessions), the authentication front
end server ‘core dumped’ (crashed) and stopped responding to new
connection requests. Although the secondary RADIUS platform did not crash, the
extra load generated caused a severe performance issue, due to the
sheer volume of connection requests. The problem was exacerbated by the way
many customers’ routers are configured to constantly retry should the
connection attempt fail. Ultimately it was these retried attempts that
kept the platform under severe load, preventing us from successfully
bringing back the primary server, and meaning more failed connection
attempts and a loop forming that was difficult to break.
In order to resolve the problem, our engineers had to re-assign
additional network resources from other areas of our network, and two new
servers were built and permanently added to the RADIUS platform. As a
result of the problem, around a third of our customers were unable to
connect to the Internet until late morning, and a handful of customers were
unable to connect until the early afternoon.
We do understand and acknowledge the obvious frustrations that this
incident caused for many people, and we are now putting extra steps in
place to ensure that an incident like this does not occur again. This will
include further backup authentication servers to improve the resiliency
of our network. Additionally, improved flow control for RADIUS
packets will ensure that the RADIUS platform itself cannot become overloaded
due to too many connection attempts at once.
We do take this type of incident very seriously and work is now in
progress to implement revisions to our maintenance strategy in order to
ensure that any chance of customer impacting problems are kept to an
absolute minimum. In this instance we did not provide the right level of
customer experience, and as well as the technical issues, we recognise
that the response from our customer support team also failed to meet our
Customer Support Centre Response
During the problems last Tuesday morning, we believe approximately
15,000 calls to our support centre were attempted. This is vastly more than
we have ever received over such a short period of time. As a result of
this very high volume, a number of these calls will have been met with
engaged tones. We had already identified this as a limitation of our
current telephone system, and can confirm that a new and much improved
system is already being installed within our Customer Support Centre.
There was also additional impact on support response times during the
following few days because of the
high number of customer fault tickets raised during the incident.
As well as reviewing our network resiliency since the event, we are now
also carrying out a review of our Service Status communication
mechanisms, including the positioning of recordings on our phone system and the
functions of our web based service status posting tool. We are also
conscious of the need for more detailed planned maintenance notices that
give ample information about the work taking place and any possible
risks that might be involved.
Details of the intermittent issues when accessing some web sites
Since Saturday 4th March some customers have experienced intermittent
issues when trying to access some Internet sites. This has been caused
as a result of problems at Abovenet, one of our primary network
connectivity suppliers. Their major network failure and subsequent performance
issues were initially diagnosed as a DNS issue, but later identified as
failures within the routing of multiple core routers, and it was noted
that this problem impacted general Internet routing.
The problem affected many UK and US sites, and other UK ISPs who are
customers of Abovenet also reported problems. Although we were able to
manually alter our routing tables in order to alleviate the issues caused
to some customers, the intermittent nature of the Abovenet fault has
continued to cause problems throughout the week. We are working with
Abovenet and our other Internet connectivity suppliers to ensure this is
fully resolved and that there is no potential within their network for
further similar issues to be repeated.