-
Notifications
You must be signed in to change notification settings - Fork 68
/
troubleshooting.html.md.erb
425 lines (316 loc) · 18 KB
/
troubleshooting.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
---
title: Troubleshooting Cloud Foundry
owner:
- CAPI
- Release Integration
---
This guide provides help with diagnosing and resolving issues encountered when
installing and running Cloud Foundry.
This topic assumes you are using [BOSH CLI v2](https://bosh.io/docs/cli-v2.html).
##<a id="issues-installing"></a>Troubleshoot Cloud Foundry installation issues
An installation or update can fail for reasons that have nothing to do with the
software that you are installing.
If an installation or update fails once, start over and try again.
If it fails a second time, use the following information to troubleshoot the
issue.
###<a id="bound-timeout"></a>Timeouts in "creating bound missing vms" phase
When deploying Cloud Foundry with BOSH, the "creating bound missing vms" phase
occurs after package compilation.
A process in the "creating bound missing vms" phase can time out if a BOSH Agent
fails to start correctly or if the Agent cannot connect to the NATS message bus.
<pre class="terminal">
. . .
Started preparing package compilation > Finding packages to compile. Done (00:00:00)
Started creating bound missing vms
Started creating bound missing vms > api_worker_z1/0. Failed: Timed out pinging to 1da06ba3de2f after 600 seconds (00:12:45)
Error 450002: Timed out pinging to 1da06ba3de2f after 600 seconds
</pre>
Perform the following steps to determine the cause of the time out:
* [Use your IaaS console](#boot-check) to make sure the timed-out virtual machine (VM) is booting correctly
* [Check the agent log](#handshake) on the VM that timed out for a handshake connection
between the BOSH Director and the Agent
* [Use the Netcat networking utility](#netcat) to test for routing
issues to the NATS IP and port from the VM that timed out
####<a id="boot-check"></a>Use your IaaS console to make sure the timed-out VM is booting correctly
For details on how to use your IaaS console to make sure the timed out VM is
booting correctly, see your IaaS documentation.
####<a id="handshake"></a>Check the agent log on the VM that timed out for a handshake connection between the BOSH Director and the agent
1. Use your IaaS virtualization console to open a terminal window on the VM that
timed out and log in as root.
1. Open the `/var/vcap/bosh/log/current` log file in a text editor.
1. Search the log file for a "handshake" between the BOSH Director and the BOSH
Agent.
This connection is represented in the log as a `ping` and a `pong`:
<pre class="terminal">
. . .
2013-10-03_14:35:48.58456 #[608] INFO: Message: {"method"=>"<strong>ping</strong>", "arguments"=>[], reply_to"=>"director.b668-1660944090e4"}
2013-10-03_14:35:48.60182 #[608] INFO: reply_to:director.b668-1660944090e4: payload: {:value=>"<strong>pong</strong>"}
</pre>
1. If the handshake does not complete, the Agent cannot communicate with the
Director.
####<a id="netcat"></a>Use Netcat to test for routing issues to the NATS IP
1. Use your IaaS virtualization console to open a terminal window on the VM that
timed out and log in as root.
1. Open the `/var/vcap/bosh/log/current` log file in a text editor.
1. Search the beginning of the log file for a line labeled <strong>INFO: loaded
new infrastructure settings</strong>.
This line contains a JSON blob of key/value pairs representing the expected
infrastructure for the BOSH Agent.
In this line, locate the IP address and port following
<strong>nats://nats:nats@</strong>.
<pre class="terminal">
. . .
2013-10-03_14:35:21.83222 #[608] INFO: loaded new infrastructure settings: {"vm"=>{"name"=>"vm-4d80ede4-b0a5", "id"=>"vm-360"}, {"user"=>"agent", "password"=>"agent"}}, "mbus"=>"nats://nats:nats@<strong>192.0.2.17:4222</strong>", "env"=>{"bosh"}}}}}
</pre>
1. Run the [Netcat](http://nmap.org/ncat/) command `nc -v IP-ADDRESS PORT` to
determine whether it is possible to establish a connection between the NATS
message bus and this VM.
<pre class="terminal">
$ nc -v 192.0.2.17 4222
Connection to 192.0.2.17 4222 port [tcp/*] succeeded!
</pre>
###<a id="out-of-space"></a>Out of Disk Space error
If, during installation, log files fill all available disk space on a computer,
the operating system reports an "Out of Disk Space" error.
To resolve this issue, delete these log files from the `tmp` directory of the
affected computer manually or by rebooting.
##<a id="issues-running"></a>Troubleshoot Cloud Foundry operation issues
If you have not done so already, set an alias for the BOSH Director of the environment you are
troubleshooting. You need this alias to run the BOSH CLI commands in this section.
Run the following command to create a local alias for the BOSH Director using the BOSH CLI:
<pre>bosh alias-env MY-ENV -e DIRECTOR-IP-ADDRESS --ca-cert PATH/TO/CERT</pre>
Replace the placeholder text with the following:
* `MY-ENV`: Enter an alias for the BOSH Director, such as `dev` or `prod`.
* `DIRECTOR-IP-ADDRESS`: Enter the IP address of your BOSH Director VM.
* `PATH/TO/CERT`: Enter the path to the Director CA certificate.
For example:
<pre class="terminal">
$ bosh alias-env dev -e 10.0.0.3 --ca-cert /var/workspaces/default/ca_certificate
</pre>
###<a id="bosh-cli"></a>Use the BOSH CLI for troubleshooting
1. Run `bosh -e MY-ENV login` and provide your credentials to log in to the BOSH Director.
Replace `MY-ENV` with the alias you set for your BOSH Director.
<pre class="terminal">
$ bosh -e dev login
User ():
Password ():
</pre>
1. Use the following BOSH commands to troubleshoot your deployment:
* [VMs](#vms): Lists all VMs in a deployment
* [Cloud check](#cloudcheck): Cloud consistency check and interactive repair
* [SSH](#ssh): Start an interactive session or execute commands with a VM
####<a id="vms"></a>BOSH VMs
`bosh vms` provides an overview of the virtual machines (VMs) BOSH is managing as part of the current deployment.
To use this command, run `bosh -e MY-ENV -d MY-DEPLOYMENT vms`.
Replace `MY-ENV` with your environment, and `MY-DEPLOYMENT` with a deployment. `-d MY-DEPLOYMENT` is optional.
<pre class="terminal">
$ bosh -e prod -d cf vms
</pre>
`bosh vms` may show a VM in an <strong>unknown</strong> state.
Run `bosh cloud-check` on VMs in an unknown state to have BOSH attempt to
diagnose the problem.
You can also use `bosh vms` to identify VMs in your deployment, then use
`bosh ssh` to SSH into an identified VM for further troubleshooting.
`bosh vms` supports the following arguments:
* `--vitals`: Overview also includes load, CPU, memory usage,
swap usage, system disk usage, ephemeral disk usage, and persistent disk usage
for each VM
* `--dns`: Overview also includes the DNS A record for each VM
####<a id="cloudcheck"></a>BOSH cloud check
`bosh cloud-check` attempts to detect differences between the VM state database
that the BOSH Director maintains and the actual state of the VMs.
To use this command, run `bosh -e MY -d MY-DEPLOYMENT cloud-check`.
You can also use the alias `bosh cck`.
Replace `MY-ENV` with your environment, and `MY-DEPLOYMENT` with a deployment. `-d MY-DEPLOYMENT` is optional.
<pre class="terminal">
$ bosh -e dev -d mysql cck
</pre>
For each difference detected, `bosh cloud-check` offers repair options:
* `Reboot VM`: Instructs BOSH to reboot a VM. Rebooting can resolve many
transient errors.
* `Ignore problem`: Instructs `bosh cloud-check` to do nothing.
You might want to instruct `bosh cloud-check` to ignore a problem in order to run
`bosh ssh` and attempt troubleshooting directly on the machine.
* `Reassociate VM with corresponding instance`: Updates the BOSH Director state
database.
Use this option if you believe that the BOSH Director state database is in error
and that a VM is correctly associated with a job.
* `Recreate VM using last known apply spec`: Instructs BOSH to destroy a VM and
recreate it from the deployment manifest the installer provides.
Use this option if a VM is corrupted.
* `Delete VM reference`: Instructs BOSH to delete a VM reference in the Director
state database.
If a VM reference exists in the state database, BOSH expects to find an agent
running on the VM.
Select this option only if you know this reference is in error.
Once you delete the VM reference, BOSH can no longer control the VM.
##### Example scenarios
<strong>Unresponsive Agent</strong>
<pre class="terminal">
$ bosh -e prod -d example-deployment cck
ccdb/0 (vm-3e37133c-bc33-450e-98b1-f86d5b63502a) is not responding:
- Ignore problem
- Reboot VM
- Recreate VM using last known apply spec
- Delete VM reference (DANGEROUS!)
</pre>
<strong>Missing VM</strong>
<pre class="terminal">
$ bosh -e prod -d example-deployment cck
VM with cloud ID `vm-3e37133c-bc33-450e-98b1-f86d5b63502a' missing:
- Ignore problem
- Recreate VM using last known apply spec
- Delete VM reference (DANGEROUS!)
</pre>
<strong>Unbound Instance VM</strong>
<pre class="terminal">
$ bosh -e prod -d example-deployment cck
VM `vm-3e37133c-bc33-450e-98b1-f86d5b63502a' reports itself as `ccdb/0' but does not have a bound instance:
- Ignore problem
- Delete VM (unless it has persistent disk)
- Reassociate VM with corresponding instance
</pre>
<strong>Out of Sync VM</strong>
<pre class="terminal">
$ bosh -e prod -d example-deployment cck
VM `vm-3e37133c-bc33-450e-98b1-f86d5b63502a' is out of sync: expected `cf-d7293430724a2c421061: ccdb/0', got `cf-d7293430724a2c421061: nats/0':
- Ignore problem
- Delete VM (unless it has persistent disk)
</pre>
####<a id="ssh"></a>BOSH SSH
Use `bosh ssh` to open secure shells into the VMs in your deployment.
To use `bosh ssh`:
1. Run `ssh-keygen -t rsa` to provide BOSH with the correct public key.
1. Accept the defaults.
1. Identify a VM to SSH into. Run `bosh -e MY-ENV -d MY-DEPLOYMENT vms` to list the VMs in the given deployment. Replace `MY-ENV` with your environment alias and `MY-DEPLOYMENT` with the deployment name.
1. Run `bosh -e MY-ENV -d MY-DEPLOYMENT ssh VM-NAME/GUID`. For example:
<pre class="terminal">$ bosh -e dev -d cf-deployment ssh diego-cell/abcd0123-a012-b345-c678-9def01234567</pre>
###<a id="viewing-logs"></a>View BOSH logs
You can access BOSH logs by two methods:
* Using the `bosh ssh` command to access the log location
* Using the `bosh logs` command to output the logs to standard output or to a
file
####<a id="view-ssh"></a>Using the BOSH SSH command
Use `bosh ssh` to open secure shells into the VMs in your deployment, then
access the logs on the VM.
To use `bosh ssh`:
1. Run `ssh-keygen -t rsa` to provide BOSH with the correct public key.
1. Accept the defaults.
1. Identify a VM to SSH into. Run `bosh -e MY-ENV -d MY-DEPLOYMENT vms` to list the VMs in the given deployment. Replace `MY-ENV` with your environment alias and `MY-DEPLOYMENT` with the deployment name.
1. Run `bosh -e MY-ENV -d MY-DEPLOYMENT ssh VM-NAME/GUID`. For example:
<pre class="terminal">$ bosh -e dev -d cf-deployment ssh diego-cell/abcd0123-a012-b345-c678-9def01234567</pre>
1. Review the `/var/vcap/bosh/log/current` log file.
####<a id="view-logs"></a>Using the BOSH logs command
Use `bosh logs` to output BOSH logs to standard output or to a file.
To use `bosh logs`:
1. Identify a VM. Run `bosh -e MY-ENV -d MY-DEPLOYMENT vms` to list the VMs in the given deployment. Replace `MY-ENV` with your environment alias and `MY-DEPLOYMENT` with the deployment name.
1. Run `bosh -e MY-ENV -d MY-DEPLOYMENT logs VM-NAME/GUID` to view the logs from the identified VM. For example:
<pre class="terminal">$ bosh -e dev -d cf-deployment logs diego-cell/abcd0123-a012-b345-c678-9def01234567</pre>
###<a id="non-responsive"></a>Log in to a non-responsive BOSH VM
A VM under heavy system load can stop responding to some commands but still
function in a limited way.
If the VM does not respond to `bosh ssh`, use the following steps to open a
secure shell in a more direct manner:
1. Run `bosh -e MY-ENV -d MY-DEPLOYMENT vms` and note the IP address of the non-responsive VM.
<pre class="terminal">$ bosh -e prod -d cf-deployment vms</pre>
1. Run `ssh -t vcap@IP-ADDRESS 'sh'` where IP-ADDRESS is the IP address of the
non-responsive VM.
<pre class="terminal">$ ssh -t vcap<span>@</span>192.0.2.53 'sh'</pre>
###<a id="terminate"></a>Terminate a BOSH SSH session
Use `~.`, entered at the beginning of a new line, to terminate a `bosh ssh` or
`ssh` session.
To terminate a `bosh ssh` or `ssh` initiated from the jumpbox or other server,
use `~~.`, entered at the beginning of a new line.
The outermost secure shell session consumes the second `~` and passes the
remaining `~.` command to the inner `ssh` session.
###<a id="failing-job"></a>Debug a failing job
1. Run `bosh -e MY-ENV -d MY-DEPLOYMENT vms` to determine which job VMs in your deployment are failing.
Note the job name and index of the failing VM.
1. Run `bosh -e MY-ENV -d MY-DEPLOYMENT ssh VM-NAME/GUID` to open a secure shell into the failing VM.
1. Run `sudo su -` to enter the root environment with root privileges.
1. Run `monit summary` to determine which processes are not running.
1. Review the log files found in `/var/vcap/sys/log/` to determine the root
cause of the process failures.
Some of these logs are formatted using steno with timestamps instead of human-formatted dates.
You can use `steno-prettify` to make the logs more human-readable.
1. Use `monit restart all` or `monit restart PROCESS` to start the processes.
Execute the following commands to set up steno-prettify in the cloudcontroller:
<pre class="terminal">
export CC_JOB_DIR=/var/vcap/jobs/cloud_controller_ng
source $CC_JOB_DIR/bin/ruby_version.sh
CC_PACKAGE_DIR=/var/vcap/packages/cloud_controller_ng
export BUNDLE_GEMFILE=$CC_PACKAGE_DIR/cloud_controller_ng/Gemfile
export HOME=/home/vcap # rake needs it to be set to run tasks
if [ -f $BUNDLE_GEMFILE ]; then
alias steno-prettify="bundle exec steno-prettify"
echo "ready to use steno-prettify alias, try steno-prettify on one of the following files:"
find /var/vcap/sys/log/ -name "*.log" | egrep -v "err|out|ctl" | xargs ls -al
else
echo "could not find Gemfile into ${STENO_DIR}:" `ls -al ${BUNDLE_GEMFILE}`
fi
</pre>
###<a id="debugging-cc-diagnostics"></a>View Cloud Controller diagnostics
If your BOSH deployment experiences a resource spike, an infinite loop, or another undesirable performance issue, see the Cloud Controller (CC) diagnostics JSON file for debugging.
The file contains data that are unavailable in the output from [Cloud Foundry logging](./managing-cf/logging.html), including stack traces for running threads.
Follow the instructions below to SSH into the Cloud Controller VM and populate the diagnostics file with updated information.
1. Provide BOSH with the correct public key.
<pre class="terminal">
$ ssh-keygen -t rsa
</pre>
1. Accept the defaults.
1. Begin an SSH session to the deployment.
<pre class="terminal">$ bosh -e prod -d cf-deployment ssh diego-cell/abcd0123-a012-b345-c678-9def01234567</pre>
1. Enter the number corresponding to the <code>cloud_controller</code> from the list of VMs.
1. Retrieve the PID of the CC process.
<pre class="terminal">$ cat /var/vcap/sys/run/cloud\_controller\_ng/cloud\_controller\_ng.pid </pre>
1. Send the user defined `USR1` signal to the CC process to populate the JSON log file for reference. This command does not terminate the CC process.
<pre class="terminal">
$ kill -USR1 CLOUD-CONTROLLER-PID
</pre>
1. Navigate to the JSON file in the diagnostics directory to see the record of the `USR1` signal passed to your CC VM.
The value of the `cc.directories.diagnostics` property in `/var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml` contains the location of the diagnostics directory.
The default location of the diagnostics directory is `/var/vcap/data/cloud_controller_ng/diagnostics`.
###<a id="debugging-cloudcontroller"></a>Use the interactive Cloud Controller shell
The Cloud Controller jobs embeds a [pry shell](http://pryrepl.org/). This may allow users to interact with
cc\_ng classes, such as scripting some operations, or to access the cc\_db using model classes.
<pre class="terminal">
$ cd /var/vcap/jobs/cloud_controller_ng
$ bin/console
[...]
</pre>
###<a id="debugging-full-dbs"></a>Troubleshoot Cloud Foundry databases
The `postgres` BOSH job hosts the different databases used by Cloud Foundry, such as `diego`, `ccng`, and `uaadb`. If the `postgres` job reaches 100% persistent disk usage, it can impact performance. Perform the following steps to diagnose your `postgres` job:
1. Retrieve the credentials for the `postgres` job from your BOSH manifest.
1. Connect to the database using the PostgreSQL client of your choice. Alternatively, you can connect to the database using a PostgreSQL web interface such as [phppgadmin-cf](https://github.com/cloudfoundry-community/phppgadmin-cf/) or open an SSH tunnel to the database using `cf ssh`.
1. Use the following SQL query to identify the largest schema:
```
SELECT schema_name,
pg_size_pretty(sum(table_size)::bigint),
(sum(table_size) / pg_database_size(current_database())) * 100
FROM (
SELECT pg_catalog.pg_namespace.nspname as schema_name,
pg_relation_size(pg_catalog.pg_class.oid) as table_size
FROM pg_catalog.pg_class
JOIN pg_catalog.pg_namespace ON relnamespace = pg_catalog.pg_namespace.oid
) t
GROUP BY schema_name
ORDER BY schema_name
```
1. Using the output from the above query, identify the largest table.
###<a id="force-recreate"></a>Force a VM recreate
If BOSH or an operator identifies a VM as corrupted, BOSH can recreate the VM.
This recreation function can fail if a drain script on the VM is broken or times
out.
To resolve this issue:
1. SSH into the VM and use the `sv stop agent` command to stop the BOSH Agent.
<pre class="terminal">
$ bosh -e dev -d cf-deployment ssh diego-cell/abcd0123-a012-b345-c678-9def01234567
$ sv stop agent
</pre>
1. Let the BOSH Health Monitor automatically restart the VM.
To follow the status of this process:
* Run `bosh -e MY-ENV tasks` to determine the task ID of the "scan and fix" task.
Add `-d MY-DEPLOYMENT` to view tasks for a specific deployment.
<pre class="terminal">$ bosh -e dev -d cf-deployment tasks</pre>
* Run `bosh -e MY-ENV task TASK-ID` with the task ID of the "scan and fix" task.
<pre class="terminal">$ bosh -e dev task 83</pre>