Changes between Initial Version and Version 1 of DiagnosingSluggishness


Ignore:
Timestamp:
Jun 1, 2011, 6:30:33 PM (13 years ago)
Author:
Daniel Kahn Gillmor
Comment:

initial import from http://cmrg.fifthhorseman.net/wiki/DiagnosingSluggishness

Legend:

Unmodified
Added
Removed
Modified
  • DiagnosingSluggishness

    v1 v1  
     1[[PageOutline]]
     2= Why is my computer so slow? =
     3
     4Many folks get frustrated with their computer taking "too long" to respond to them, or to do things that they want it to do.  While some of these problems can be fixed with better software implementations, some of them are related fundamentally to underlying resource exhaustion which no software fix can address.  Even if you think that a software fix is possible, it's good to think about what resources are at their limit (the "bottleneck") so you can focus your software development energies in the right direction.
     5
     6So how do you know where the bottleneck actually is?  Which resources might be causing problems that are noticable to the user of a given machine?  If you're using [wiki:DiagnosingSluggishness/Windows Windows, i can't help you much].  If your operating system uses a Linux kernel, a good starting point is `vmstat`.  When invoked as:
     7{{{
     8vmstat 1 5
     9}}}
     10it will produce a series of 5 rows, one per second, which each tell you a lot about the state of the system.  The first row tells you about the aggregated state of the system since it booted, and each row thereafter shows you values for the system during the last interval.  Specific details about the number can be found in [Man:vmstat the man page]. 
     11
     12On a typical modern system there are 4 main categories of resources whose exhaustion causes user-noticable lag (please give a shout if there are other categories i'm ignoring):
     13
     14== CPU Cycles (aka "the processor") ==
     15
     16Your basic computer internally
     17can really only be doing one thing at a time.  Even more modern
     18computers with mult-core CPUs and/or multiple processors can only
     19be doing a handful of things at once.  The illusion of
     20"multitasking" comes from the fact that operating system forces the
     21CPU to switch contexts very rapidly between all the outstanding
     22tasks that the user has instructed it to work on.  If you've
     23instructed your computer to do more work than it has time to get
     24to, you'll perceive it as sluggishness.
     25
     26Some example ways to exhaust your CPU: complicated statistical
     27analysis (e.g. seti@home), heavy-duty cryptanalysis (e.g. password
     28cracking), algorithmically-intensive data transformations
     29(e.g. transcoding video), excessive javascript in pages you've
     30viewed (e.g. the countdown clock that used to be on ussf2007.org),
     31etc.
     32
     33Symptoms that might mean you've got a CPU bottleneck:
     34
     35 * run "vmstat 1": do you see the "us" and "sy" (user and kernel) columns in the "cpu" section dominating the "id" and "wa" (idle and I/O wait) columns?
     36 * are all the fans on your system running full blast, and the computer is churning out a lot of heat?  Processors under heavy load get hot and need to purge their heat somewhere.
     37
     38Here's `vmstat` of a computer under heavy CPU load (note that the first line just shows that the CPU on `monkey` has been idle for 79% of the time since it booted):
     39{{{
     40[0 dkg@monkey ~]$ vmstat 1 5
     41procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
     42 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
     43 4  1 131464   5440   5532 102616    1    1    20    44  170  289 18  2 79  1
     44 3  1 131464   6308   5532 100744    0    0     0    44  272 2593 25 75  0  0
     45 3  1 131464   5776   5532 100184    0    0     4     0  270 2560 25 75  0  0
     46 3  0 131464   6248   5548  98192    0    0     4  5188  303 2125 22 78  0  0
     47 4  1 131464   6056   5548  97668    0    0     0     0  452 3884 39 61  0  0
     48[0 dkg@monkey ~]$
     49}}}
     50
     51
     52== RAM (aka "memory") ==
     53
     54The memory is the "working set" of information
     55that the computer can access relatively fast.  Every tab you have
     56open in your web browser (even if it's not foregrounded and no
     57javascript is running) will set aside some RAM to keep track of it.
     58Every word processing document you're writing is loaded into RAM
     59while you're writing it.  Pretty pictures on your desktop require a chunk of RAM.
     60
     61Modern computers are clever enough to use
     62swap (aka "virtual memory") when you ask them to hold more things
     63in RAM than they physically have -- this just means that they
     64substitute the slower (but much larger) hard disk for RAM when things get tight.  A common principle here is to eject the LRU (Least Recently Used) block of RAM, writing it out to a special place on disk (the "swap file"), and loading in new data requested by an active process.
     65
     66Symptoms that might mean you've got a RAM bottleneck:
     67
     68 * frequent, heavy disk activity when you're not trying to write out or copy large files usually means that you're swapping.  Since you only swap when you've run out of RAM, that's a bad sign.  If you're lucky enough to have a machine with a functional disk activity light, keep an eye on it.  If you don't have a disk activity light, listen with your ears: unless you've got a solid-state disk (as of 2008, if you aren't sure whether you have a solid state disk, you ''probably don't have one''), disk activity like this is actually audible as whirring and clicking.
     69 * Applications shutting down without warning could mean hitting a hard wall on RAM.  If the computer has ''X'' amount of RAM, and you've instructed the operating system to set aside ''Y'' amount of swap space, and then you ask the computer to do tasks that consume more than ''X + Y'' memory in aggregate, your computer has to decide what to do:  it's probably going to start by killing off some of the more offensive memory hogs to get the system back into a normal state.  On Linux-based systems, this job is performed by a kernel subsystem called the [http://linux-mm.org/OOM_Killer oom-killer] (out of memory killer).  It's kind of a black art and you really don't want to get to the point where you need it.
     70
     71Here's `vmstat` of a computer that has exhausted its RAM and is chewing into swap:
     72{{{
     73[0 dkg@monkey ~]$ vmstat 1 5
     74procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
     75 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
     76 2  0 131548   5880   4068 121408    1    1    19    48  171  290 18  2 79  1
     77 1  1 131628   4920   4448  84212    0   80    12    80  116  313  2 16 75  7
     78 6  4 143132   5124    256  50252    0 11496     0 11648  615 1296  1 69  0 30
     79 2  1 158556   4856    172  43008   96 15404   100 15404  745  721  3 70  0 27
     80 7  7 181712   4732    192  38280  528 22192   556 22276  983  954  4 84  0 12
     81[0 dkg@monkey ~]$
     82}}}
     83
     84Things to note:
     85
     86 * the `swap` columnset is active in both `si` (swap in, meaning bringing a block of RAM in from disk) and `so` (swap out, meaning writing a page of RAM out to disk)
     87 * the `cpu` spends a good chunk of its time in the `wa` (I/O wait) state, meaning that it is otherwise idle, but waiting for some sort of disk access.
     88 * the number of swapped pages (`swpd`) is increasing
     89 * the amount of `memory` allocated to buffers (`buff`) and `cache` drop precipitously -- buffering and caching are two performance-optimizing ways that the kernel makes use of memory that is otherwise unallocated.  They speed up your use of the machine without you asking them to do anything concretely, but they are not strictly required to make the computer work correctly.  So when memory gets tight, the kernel reclaims the RAM it was using for buffers and caches to try to accommodate the new requirements coming from the user.
     90
     91== Disks (aka "I/O") ==
     92
     93There are really two kinds of resources you can exhaust related to disks, but only one of them typically results in the sluggishness this page attempts to diagnose.  I'll get the other one out of the way first:
     94
     95=== Disk Space (capacity) ===
     96
     97This is the form of disk resource people are most used to seeing exhausted.  You get messages like "cannot save file, disk is full" from your programs, or you get weird misbehaviors or system failures -- services being unable to log, mail transfer agents bouncing mail, etc.
     98
     99The quickest way on a reasonable system to get a sense of how your disks are is just `df`.  Using the `-h` flag shows you the numbers in "human-readable" format:
     100
     101{{{
     1020 ape:~# df -h
     103Filesystem            Size  Used Avail Use% Mounted on
     104/dev/mapper/vg_ape0-root
     105                     1008M  802M  156M  84% /
     106tmpfs                  64M     0   64M   0% /lib/init/rw
     107udev                   10M   44K   10M   1% /dev
     108tmpfs                  64M     0   64M   0% /dev/shm
     109/dev/sda1             228M  139M   78M  65% /boot
     1100 ape:~#
     111}}}
     112You see here that `ape` only has 156MB available on its root filesystem.  This is pretty tight: any time you get close to 90% full on a filesystem, the kernel has to do a lot more work to decide how to place the files you want to store.  With a near-full filesystem, storing a larger file can take more time because it often needs to be broken up into smaller chunks and distributed across the disk.  Having files scattered across the disk ("fragmented") means more work for the moving parts within the disk when you want to access that file.  Moving parts are slow compared to electronics.
     113
     114But the modern operating system kernel hides almost all of these shenanigans from the end user pretty well, so you usually won't notice performance degradation from full filesystems until they're actually full, at which point you'll get the nasty hard errors mentioned above.  This brings me to the flavor of disk-related resource that often ''does'' cause perceptible performance problems...
     115
     116=== Disk Throughput ("I/O" or "bandwidth") ===
     117
     118Where disks are most likely to cause user-visible sluggishness is in their ''throughput'', not their ''capacity''.  Accessing data off of a disk (or writing data to a disk) is extremely slow when compared to other parts of a modern computer.  If the user has to experience this behavior directly, they'll likely feel like the computer is sluggish.
     119
     120A good kernel can deal with this for disk ''writes'' transparently, as long as there is enough RAM: when the user says "save this file to disk", the kernel just says, "ok, fine", caches the data first in (fast) RAM, returns control to the user, and then (while the user is otherwise idle) dribbles the data out to (slow) disk in little chunks when the opportunity presents itself.  If you've ever tried to save a file to a floppy disk in Windows 98, you know the annoyance that comes from an OS ''not'' doing this sort of "write caching": writes to floppy disks under that OS were synchronous (they had to happen exactly when the user requested them) -- so the whole machine would lock up while the file was actually being saved.
     121
     122Modern kernels also do similar sleights-of-hand on disk ''reads'', though it is harder to do because predicting what the user is going to want to read next is an imprecise art.  So if some subsystem in your computer is accessing the disk a lot, then other programs which need to access the disk will be noticeably slower, as they wait for their turn at the limited bandwidth available to the disk. 
     123
     124Here's a common situation: a program starting up needs to read its executable binary (and all linked libraries) from your disk into RAM.  If you've already opened the program previously (and you have enough RAM), it's likely that the computer will have that copy of it in RAM already, so you won't see the sluggishness related to pulling it from disk.  But if you're low on RAM already (so the cached data has been ejected from RAM), or you've just never opened this program before, it won't be cached.  If another process is hammering the disk (e.g. say you're making two copies of a multi-gigabyte DVD image), then the disk accesses the program needs to actually start up will be interleaved with the other requests made of the disk.  This will manifest itself as a slow program startup.
     125
     126Note also that it starts getting tricky here: when you [#RAMakamemory run out of RAM], your computer often exhausts the throughput to your disks because it's swapping.  So sometimes when you see that the disk activity is excessive, it could be due fundamentally to RAM exhaustion: so check there first.
     127
     128Symptoms that might mean you have a disk I/O bottleneck:
     129
     130 * constant hard disk activity -- if your machine is physically nearby, look for the disk activity lights, or listen for the whirring, clicking sound that an active disk makes.  If your machine is remote, use `vmstat` to look for high levels of `bi` and `bo` in the `io` grouping, or high levels of `wa` in the `cpu` grouping.
     131 * launching programs for the first time since boot, or opening files for the first time since boot takes much longer than it should.
     132
     133Here's `vmstat` of a system just coming under heavy I/O load:
     134{{{
     1350 dkg@squash:~$ vmstat 1 5
     136procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
     137 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
     138 2  0     56   6944 238140 106932    0    0    15    44  260   54  4  1 93  2
     139 1  0     56   4896 252776  94608    0    0  7916     0  253  204  9 58 24  9
     140 1  1     56   4644 253984  93612    0    0 10624  2800  253  264 12 80  0  8
     141 1  0     56   4632 252984  93612    0    0  5376  5488  253  261  3 52  0 45
     142 1  0     56   5548 252528  93416    0    0  4224  4144  253  263  3 44  0 53
     1430 dkg@squash:~$
     144}}}
     145
     146Note that there are no idle CPU cycles (`id` in the `cpu` section), but a significant amount of cycles in I/O wait (`wa` in `cpu`).  The I/O wait column is new as of Linux kernel 2.6 -- if you're using a 2.4 series kernel, you won't be able to see this.  CPU cycles counted in this last column are cycles in which the CPU is idle, but there is an outstanding request to the disks ([https://www.debian-administration.org/users/dkg/weblog/15 or does other I/O count?]).  This indicates that if the I/O had completed, more activity could happen (because the CPU is otherwise idle), so large values in that column are good indicators of a disk throughput problem.
     147
     148Note also that `bi` and `bo` are large values (in `io`), while there is no actual swap activity.  This is helpful to distinguish from an out-of-RAM state.
     149
     150== Network ==
     151
     152Diagnosing network resource exhaustion is tougher than the other forms of resource exhaustion because it's often caused by external systems.  So saying "it's a network problem" is sometimes the answer of last resort when none of the other resources are anything close to exhausted.
     153
     154''FIXME: more to write here''
     155
     156If you suspect that your upstream connection is clogged and you control the router/gateway, you might try using [wiki:iftop] to figure out which particular client is hogging the pipe.