Chunky Code: Theories of Scaling

Mike and I went to a .NET usergroup meeting called "The Scaling Habits of ASP.NET Applications" by the excellent Richard Campbell (of DotNetRocks radio, Strangeloop etc) on Monday.

I highly recommend watching a screencast of his presentation(pptx), but this post tries to summarise his key points

He took us through the evolution of a website:

Version 1 - internal project knocked up to serve a very defined purpose:

10-50 requests/second
5-15 users
15 active sessions at peak

Problems	Solutions
Multi-user issues	Fix multi-user data access
Complex input screens	Get User Feedback
Reports

Version 2 - rolled out to a few more teams

Bug fixing, UI/App rethink, user base diversifies

50-100 requests/second

15-50 users (including some remote)

30 active sessions at peak

Reviewed what 'server failure' typically looks like

Memory consumption above 80%
Processor consumption at 100%
Request queues start to grow out of hand
Page timeouts
Sessions get lost
People cannot access the site / lose their work!

When IIS is having memory issues, typically it starts to do "Emergency Garbage Collection", and clears out the cache.

Problems	Solutions
Fights with IT over remote access	Dedicated Web server
Reaching the single server limit	Separate shared SQL Server
	Fix querying and page size

Version 3 - general rollout, formal budget and IT

300 to 1000 requests/second

100 to 500 users

300 active sessions at peak

Problems	Solutions
Bad performance is now a business expense	Move to multiple web servers (Load balancing)
Consequences of downtime are now significant (they cost money!	More bandwidth (hosted solution)
	Get methodical, use profiling
	Get the facts on the problem areas of the application
	Work methodically and for the business on addressing the slowest lines of code
	Focus on understanding what the right architecture is rather than on ad-hoc architecting
	Let the caching begin!

Version N - business success

IT costs now out-weigh software development

Shipping features take months, or shortcuts are taken (not uncommon)

IT and Dev process is a focus: Tech politics.

500+ request/second

5000+ users

3000+ active sessions at peak

Problems	Solutions
Running out of memory with inproc sessions	Cache, cache and more cache
Worker process recycling	Start to think about cloud computing
Cache coherency	Architecture is now hardware and software: · Use third-party accelerators · Create a performance team and focus on best practices · Use content routing · Separate and pre-generate all static resources Out of Process Session managers
Session management
· Now the problem is that scale and performance are intertwined · A new class of ‘timing’ problem shows up under load o And are almost impossible to reproduce outside of production · Caches are flushed more than expected o And every time it happens, performance plummets

More than four webservers on a cluster cause scaling issues - needs separate load balancers.

Now the problem is that scale and performance are intertwined

A new class of ‘timing’ problem shows up under load

And are almost impossible to reproduce outside of production

Caches are flushed more than expected

And every time it happens, performance plummets

Summary

Focus on actual user performance problems?

What is reality?

Start with low hanging fruit

Use methodical, empirical performance improvement

At large scale, the network is the computer

Don't do things unless they need doing - it all costs money and introduces complexity.

Ask your client to put the following in order of priority:

Accuracy
Reliability
Performance
Cost

They must put them in order! For instance, if the order is as above, then you would reduce the amount of data stored in cache, and increase server capacity, so that all data is shown in real time

Work out the cost of downtime vs cost of reliability (99% vs 100% uptime)

My Thoughts

We need to work out where we are with the various applications, and what will happen when we upgrade them - I think we are at around version 3 (with some hangovers from 1 and 2) - when/will we get to version N?
Start measuring and reporting more (see below for things to measure). Performance is a feature, so if you can prove that you have decreased the time to log in or whatever, then it is worth shouting about. Additionaly, if you can see that times are increasing then you can start to look into why that is.
Work on providing data on how much bad performance costs. In our application, every minute that a user is waiting to log in, or for a price, or a page to load, costs the business (or the clients) in salary and productivity. We need to start calculating the cost, to work out when it is worth improving the sites performance.
Start to include performance sprints in our programs, to fix low hanging fruit firstly, and to introduce performance practices in the team. Don't be obsessive, as performance coding means less maintainable coding.

Further Info

When Network guys and Development guys get together to fix a scaling/performance issue, it can very easily turn into a "It's the code"/"It's the server" argument. This can be fixed with a face to face meeting over pizza.

At that point, the development team need to go through in much more detail things like web.config, webservices, and any other stuff that gives the Network guys a fuller overview of how the system is put together (then they can see how easy it is for us to point the code at a different DB server, or that we are bypassing AD security for forms authentication). Dev guys also need to explain where the bottlenecks in the system are.

The network team needs to bring a network diagram to give a better understanding of how the user accesses the developers site, and explain the redundancy model. It would also be helpful if development was given access to production log files (or sent them, or shown where the backups are). He mentioned that Webtrends is good for analysing these files.

Session Management
Session Management solutions:

In Process session management - usual .net session management. Uses up process memory, can get broken by emergency GC.
SQL Session Management. Next easyest to implement. More scalable, but slow, and not really what SQL is designed for.
3rd Party out of process session managers. Cost money, but bring scalability and performance benefits.

Site metrics

Metrics for measuring site capability are varied:.

Capacity

Number of Users
Number of Active Users
Number of Concurrent Users (most useful, and typically a lot lower than expected)

Throughput

Page Views Per Minute
Requests Per Second
Transactions per second

Performance

Loading time
Time to First Byte / Time to Last Byte (TTLB more realistic - users don't care about TTFB!)

Use perfmon etc, to measure:
Requests Per Second
Requests Queued (typically see this with SQL calls when the Garbage Collector is in overdrive).
CPU
Global .NET heap

If a performance boost is required, before immediately going for the server side code, it is worth carrying out the following formula on the problem page (it's not as hard as it looks!):

R	Response time (in seconds)
Payload	Total number of bytes being transmitted
Bandwidth	The transfer rate available
RTT	Round Trip Time
AppTurns	Number of requests that make up the web page
Concurrent Requests	How many requests will be run simultaneously to build the page
Cs	Compute time on the server
Cc	Compute time on the client

(see http://www.campbellassociates.ca/blog/PermaLink,guid,849dac76-8899-424b-b514-e29ed93e0b21.aspx for more information on this bit)

What this equation may show is that the percentage of time taken up by the server compute time (Cs) may only be a small %age of the overall response time. It maybe that reducing the number of .js, .css and image files on the page to reduce the AppTurns would provide an easier win, in the first instance

Chunky Code

Friday 17 June 2011

Theories of Scaling

No comments:

Post a Comment

Pageviews from the past week