Friday, 17 June 2011

Theories of Scaling

Mike and I went to a .NET usergroup meeting called "The Scaling Habits of ASP.NET Applications" by the excellent Richard Campbell (of DotNetRocks radio, Strangeloop etc) on Monday.


I highly recommend watching a screencast of his presentation(pptx), but this post tries to summarise his key points

He took us through the evolution of a website:


Version 1 - internal project knocked up to serve a very defined purpose:


10-50 requests/second
5-15 users
15 active sessions at peak

Problems
Solutions
Multi-user issues
Fix multi-user data access
Complex input screens
Get User Feedback
Reports


    Version 2 - rolled out to a few more teams
    • Bug fixing, UI/App rethink, user base diversifies
    50-100 requests/second
    15-50 users (including some remote)
    30 active sessions at peak

    Reviewed what 'server failure' typically looks like 
    • Memory consumption above 80%
    • Processor consumption at 100%
    • Request queues start to grow out of hand
    • Page timeouts
    • Sessions get lost
    • People cannot access the site / lose their work!
      • When IIS is having memory issues, typically it starts to do "Emergency Garbage Collection", and clears out the cache.

    Problems
    Solutions
    Fights with IT over remote access
    Dedicated Web server
    Reaching the single server limit
    Separate shared SQL Server

    Fix querying and page size


    Version 3 - general rollout, formal budget and IT

    300 to 1000 requests/second
    100 to 500 users
    300 active sessions at peak



    Problems
    Solutions
    Bad performance is now a business expense
    Move to multiple web servers (Load balancing)
    Consequences of downtime are now significant (they cost money!
    More bandwidth (hosted solution)

    Get methodical, use profiling
    Get the facts on the problem areas of the application
    Work methodically and for the business on addressing the slowest lines of code
    Focus on understanding what the right architecture is rather than on ad-hoc architecting
    Let the caching begin!


    Version N - business success

    IT costs now out-weigh software development
    Shipping features take months, or shortcuts are taken (not uncommon)
    IT and Dev process is a focus: Tech politics.

    500+ request/second

    5000+ users
    3000+ active sessions at peak


    Problems
    Solutions
    Running out of memory with inproc sessions
    Cache, cache and more cache
    Worker process recycling
    Start to think about cloud computing
    Cache coherency
    Architecture is now hardware and software:
    ·         Use third-party accelerators
    ·         Create a performance team and focus on best practices
    ·         Use content routing
    ·         Separate and pre-generate all static resources
    Out of Process Session managers
    Session management
    ·      Now the problem is that scale and performance are intertwined
    ·      A new class of ‘timing’ problem shows up under load
    o  And are almost impossible to reproduce outside of production
    ·      Caches are flushed more than expected
    o  And every time it happens, performance plummets



      More than four webservers on a cluster cause scaling issues - needs separate load balancers.
      • Now the problem is that scale and performance are intertwined
        • A new class of ‘timing’ problem shows up under load
          • And are almost impossible to reproduce outside of production
        • Caches are flushed more than expected
          • And every time it happens, performance plummets
      Summary

      • Focus on actual user performance problems?
        • What is reality?
      • Start with low hanging fruit
      • Use methodical, empirical performance improvement
      • At large scale, the network is the computer
      • Don't do things unless they need doing - it all costs money and introduces complexity.
      •  Ask your client to put the following in order of priority:
        • Accuracy
        • Reliability
        • Performance
        • Cost
      They must put them in order! For instance, if the order is as above, then you would reduce the amount of data stored in cache, and increase server capacity, so that all data is shown in real time
      • Work out the cost of downtime vs cost of reliability (99% vs 100% uptime)

      My Thoughts
      1. We need to work out where we are with the various applications, and what will happen when we upgrade them - I think we are at around version 3 (with some hangovers from 1 and 2) - when/will we get to version N?
      2. Start measuring and reporting more (see below for things to measure). Performance is a feature, so if you can prove that you have decreased the time to log in or whatever, then it is worth shouting about. Additionaly, if you can see that times are increasing then you can start to look into why that is.
      3. Work on providing data on how much bad performance costs. In our application, every minute that a user is waiting to log in, or for a price, or a page to load, costs the business (or the clients) in salary and productivity. We need to start calculating the cost, to work out when it is worth improving the sites performance.
      4. Start to include performance sprints in our programs, to fix low hanging fruit firstly, and to introduce performance practices in the team. Don't be obsessive, as performance coding means less maintainable coding.

      Further Info
      When Network guys and Development guys get together to fix a scaling/performance issue, it can very easily turn into a "It's the code"/"It's the server" argument. This can be fixed with a face to face meeting over pizza.

      At that point, the development team need to go through in much more detail things like web.config, webservices, and any other stuff that gives the Network guys a fuller overview of how the system is put together (then they can see how easy it is for us to point the code at a different DB server, or that we are bypassing AD security for forms authentication). Dev guys also need to explain where the bottlenecks in the system are.

      The network team needs to bring a network diagram to give a better understanding of how the user accesses the developers site, and explain the redundancy model. It would also be helpful if development was given access to production log files (or sent them, or shown where the backups are). He mentioned that Webtrends is good for analysing these files.

      Session Management
      Session Management solutions:

      In Process session management - usual .net session management. Uses up process memory, can get broken by emergency GC.
      SQL Session Management. Next easyest to implement. More scalable, but slow, and not really what SQL is designed for.
      3rd Party out of process session managers. Cost money, but bring scalability and performance benefits.


      Site metrics
      Metrics for measuring site capability are varied:.
      • Capacity
        • Number of Users
        • Number of Active Users
        • Number of Concurrent Users  (most useful, and typically a lot lower than expected)
      • Throughput
        • Page Views Per Minute
        • Requests Per Second 
        • Transactions per second 
      • Performance
        • Loading time 
        • Time to First Byte / Time to Last Byte (TTLB more realistic - users don't care about TTFB!)

      Use perfmon etc, to measure:
      Requests Per Second
      Requests Queued (typically see this with SQL calls when the Garbage Collector is in overdrive).
      CPU
      Global .NET heap


      If a performance boost is required, before immediately going for the server side code, it is worth carrying out the following formula on the problem page (it's not as hard as it looks!):








      RResponse time (in seconds)
      PayloadTotal number of bytes being transmitted
      BandwidthThe transfer rate available
      RTTRound Trip Time
      AppTurnsNumber of requests that make up the web page
      Concurrent RequestsHow many requests will be run simultaneously to build the page
      CsCompute time on the server
      CcCompute time on the client

      (see http://www.campbellassociates.ca/blog/PermaLink,guid,849dac76-8899-424b-b514-e29ed93e0b21.aspx for more information on this bit)


      What this equation may show is that the percentage of time taken up by the server compute time (Cs) may only be a small %age of the overall response time. It maybe that reducing the number of .js, .css and image files on the page to reduce the AppTurns would provide an easier win, in the first instance 














      No comments:

      Post a Comment