I highly recommend watching a screencast of his presentation(pptx), but this post tries to summarise his key points
He took us through the evolution of a website:
Version 1 - internal project knocked up to serve a very defined purpose:
10-50 requests/second
5-15 users
15 active sessions at peak
Use perfmon etc, to measure:
Requests Per Second
Requests Queued (typically see this with SQL calls when the Garbage Collector is in overdrive).
CPU
Global .NET heap
If a performance boost is required, before immediately going for the server side code, it is worth carrying out the following formula on the problem page (it's not as hard as it looks!):
(see http://www.campbellassociates.ca/blog/PermaLink,guid,849dac76-8899-424b-b514-e29ed93e0b21.aspx for more information on this bit)
What this equation may show is that the percentage of time taken up by the server compute time (Cs) may only be a small %age of the overall response time. It maybe that reducing the number of .js, .css and image files on the page to reduce the AppTurns would provide an easier win, in the first instance
Version 1 - internal project knocked up to serve a very defined purpose:
10-50 requests/second
5-15 users
15 active sessions at peak
Problems | Solutions |
Multi-user issues | Fix multi-user data access |
Complex input screens | Get User Feedback |
Reports |
Version 2 - rolled out to a few more teams
- Bug fixing, UI/App rethink, user base diversifies
15-50 users (including some remote)
30 active sessions at peak
Reviewed what 'server failure' typically looks like
- Memory consumption above 80%
- Processor consumption at 100%
- Request queues start to grow out of hand
- Page timeouts
- Sessions get lost
- People cannot access the site / lose their work!
- When IIS is having memory issues, typically it starts to do "Emergency Garbage Collection", and clears out the cache.
Problems | Solutions |
Fights with IT over remote access | Dedicated Web server |
Reaching the single server limit | Separate shared SQL Server |
Fix querying and page size |
Version 3 - general rollout, formal budget and IT
300 to 1000 requests/second
100 to 500 users
300 active sessions at peak
Problems | Solutions |
Bad performance is now a business expense | Move to multiple web servers (Load balancing) |
Consequences of downtime are now significant (they cost money! | More bandwidth (hosted solution) |
Get methodical, use profiling | |
Get the facts on the problem areas of the application | |
Work methodically and for the business on addressing the slowest lines of code | |
Focus on understanding what the right architecture is rather than on ad-hoc architecting | |
Let the caching begin! |
Version N - business success
IT costs now out-weigh software development
Shipping features take months, or shortcuts are taken (not uncommon)
IT and Dev process is a focus: Tech politics.
500+ request/second
5000+ users
3000+ active sessions at peak
Problems | Solutions |
Running out of memory with inproc sessions | Cache, cache and more cache |
Worker process recycling | Start to think about cloud computing |
Cache coherency | Architecture is now hardware and software: · Use third-party accelerators · Create a performance team and focus on best practices · Use content routing · Separate and pre-generate all static resources Out of Process Session managers |
Session management | |
· Now the problem is that scale and performance are intertwined · A new class of ‘timing’ problem shows up under load o And are almost impossible to reproduce outside of production · Caches are flushed more than expected o And every time it happens, performance plummets |
More than four webservers on a cluster cause scaling issues - needs separate load balancers.
- Now the problem is that scale and performance are intertwined
- A new class of ‘timing’ problem shows up under load
- And are almost impossible to reproduce outside of production
- Caches are flushed more than expected
- And every time it happens, performance plummets
Summary
- Focus on actual user performance problems?
- What is reality?
- Start with low hanging fruit
- Use methodical, empirical performance improvement
- At large scale, the network is the computer
- Don't do things unless they need doing - it all costs money and introduces complexity.
- Ask your client to put the following in order of priority:
- Accuracy
- Reliability
- Performance
- Cost
- Work out the cost of downtime vs cost of reliability (99% vs 100% uptime)
My Thoughts
- We need to work out where we are with the various applications, and what will happen when we upgrade them - I think we are at around version 3 (with some hangovers from 1 and 2) - when/will we get to version N?
- Start measuring and reporting more (see below for things to measure). Performance is a feature, so if you can prove that you have decreased the time to log in or whatever, then it is worth shouting about. Additionaly, if you can see that times are increasing then you can start to look into why that is.
- Work on providing data on how much bad performance costs. In our application, every minute that a user is waiting to log in, or for a price, or a page to load, costs the business (or the clients) in salary and productivity. We need to start calculating the cost, to work out when it is worth improving the sites performance.
- Start to include performance sprints in our programs, to fix low hanging fruit firstly, and to introduce performance practices in the team. Don't be obsessive, as performance coding means less maintainable coding.
Further Info
When Network guys and Development guys get together to fix a scaling/performance issue, it can very easily turn into a "It's the code"/"It's the server" argument. This can be fixed with a face to face meeting over pizza.
At that point, the development team need to go through in much more detail things like web.config, webservices, and any other stuff that gives the Network guys a fuller overview of how the system is put together (then they can see how easy it is for us to point the code at a different DB server, or that we are bypassing AD security for forms authentication). Dev guys also need to explain where the bottlenecks in the system are.
The network team needs to bring a network diagram to give a better understanding of how the user accesses the developers site, and explain the redundancy model. It would also be helpful if development was given access to production log files (or sent them, or shown where the backups are). He mentioned that Webtrends is good for analysing these files.
Session Management
Session Management solutions:
In Process session management - usual .net session management. Uses up process memory, can get broken by emergency GC.
SQL Session Management. Next easyest to implement. More scalable, but slow, and not really what SQL is designed for.
3rd Party out of process session managers. Cost money, but bring scalability and performance benefits.
Site metrics
Session Management solutions:
In Process session management - usual .net session management. Uses up process memory, can get broken by emergency GC.
SQL Session Management. Next easyest to implement. More scalable, but slow, and not really what SQL is designed for.
3rd Party out of process session managers. Cost money, but bring scalability and performance benefits.
Site metrics
Metrics for measuring site capability are varied:.
- Capacity
- Number of Users
- Number of Active Users
- Number of Concurrent Users (most useful, and typically a lot lower than expected)
- Throughput
- Page Views Per Minute
- Requests Per Second
- Transactions per second
- Performance
- Loading time
- Time to First Byte / Time to Last Byte (TTLB more realistic - users don't care about TTFB!)
Use perfmon etc, to measure:
Requests Per Second
Requests Queued (typically see this with SQL calls when the Garbage Collector is in overdrive).
CPU
Global .NET heap
If a performance boost is required, before immediately going for the server side code, it is worth carrying out the following formula on the problem page (it's not as hard as it looks!):
R | Response time (in seconds) |
Payload | Total number of bytes being transmitted |
Bandwidth | The transfer rate available |
RTT | Round Trip Time |
AppTurns | Number of requests that make up the web page |
Concurrent Requests | How many requests will be run simultaneously to build the page |
Cs | Compute time on the server |
Cc | Compute time on the client |
What this equation may show is that the percentage of time taken up by the server compute time (Cs) may only be a small %age of the overall response time. It maybe that reducing the number of .js, .css and image files on the page to reduce the AppTurns would provide an easier win, in the first instance
No comments:
Post a Comment