System is responding slowly

There are many reasons that could lead to the users experiencing slowness, by understanding the root cause of the system slowness, we will be able to troubleshoot effectively. In order to resolve the problems effectively, it is extremely important for the users to provide the details. To some users, slow means if a page is loaded more than 2 seconds, while slowness could mean 20 seconds for the same page to be loaded. Instead of using qualitative words (big , small, slow, fast, good, not good), we will use quantitative words while communicating with the engineers who is going to help you in seconds, minutes, hours, date / time etc. Also, in order to be specific, we are going to indicate which page is slow, and which page (function / screen / module) takes how many seconds to load, and how many seconds to generate a certain report. And if we are generating the report, we will indicate exactly which date range, which branches / company, what category filter or settings we use to generate the reports. All this information is essential for us to help you identify the root cause quickly, and resolve the problems effectively.

Possible Root Cause of the System Slowness, Detection, and Solutions

No

Possible Root Cause

Troubleshooting Guide

Solutions

1

Client Device

The fastest way to check if the slowness is caused by the client device, is to access the same website / functions with multiple other devices. If Accessibility from other devices are normal, then the probability of slowness caused by the slowness in the device itself is almost certain.

Regardless of whether it is mobile or desktop devices, see below for more steps to identify the root cause:

Android Devices
- To monitor your android device performance, you could download Apps that does the trick for you, there are plenty of apps available in the market, there are some recommendations here : https://www.androidpit.com/best-apps-for-monitoring-system-performance-on-your-android-device
iOS
- There are many system monitoring Apps in the iOS world as well, examples as below:

Mac
- Out of the box, we could use the Activity Monitor that comes with Mac OS
  - https://support.apple.com/en-us/HT201464
Windows
- There are many different versions of Windows out there, and there are plenty of resources for different versions of Windows, some of the useful resources as below:
  - http://www.digitalcitizen.life/how-use-resource-monitor-windows-7
  - http://www.howtogeek.com/school/using-windows-admin-tools-like-a-pro/lesson6/all/

Apart from System Performance monitoring, it is never alien to our technical support team that some of the client devices that have been infected by Virus is known to perform badly. Usually, when we detected high CPU usage, and high Network Bandwidth, we would also be recommending the users to install anti-virus software to detect the potential security threat to the devices.

Another possibility is the web browser. When it comes to web-browser usage, settings, the following areas could be checked:

DNS settings
Too many tabs
Cache
Proxy Settings

Depends on the root cause from the troubleshooting of the client device, users could upgrade the RAM, Processor, or perform anti-virus scanning accordingly.

At the time of reading, the following guide may be out-dated, but google is your best friends, there are plenty of resources available online to resolve the problems that you are having with optimizing your devices.

2

Client Local Area Network

When devices are not connected directly to the internet, it is possible it goes thru routers, switches, proxy servers and other possible network routes to access the Wavelet EMP on the cloud. There are a few ways to check and detect if the problems are caused by the Local Area Network:

Non-technical approach
- Getting help from another users from another location to access the website
- Use a different device that directly connect to the internet, for example you are having problems with your computer accessing the website, but you can test with your phone (without connecting to Wifi, use direct internet connection, 3G etc)
Using Ping
- Ping your router IP address (not the google and not other external websites / servers)
Monitoring the network performance on the networking devices
- this hyperlink provides tutorial for D-Link routers, but other routers should be similar as well, refer to the respective user manual for other routers.
Other devices within the same local area network
- Sometimes, there's no problem with your device or routers, but there could be another device within the local area network that is infected by Virus, and consuming significant amount of bandwidth. It is also possible that somewhere in the LAN (Local Area Network), there's another users installed some applications like Torrent that download and share large files, or watching youtube, or consuming some live streaming content.

3

Client Internet Connectivity

On the client side, for easier troubleshoot. Please ask customer to use right tools such as Teamviewer. Top 3 reasons from client side

The Obvious: The URL has a typo.
- Ask the user to double-check the URL. This may seem obvious but it is very critical that the user puts in the correct URL. It is better to ask the user to send you the URL so that you can examine and try it yourself. There are three things to note in the URL.

1. 1. Is the protocol correct? (i.e) http. Your Web Server admin could have very well blocked port 80 altogether
  2. Is the domain name and the resource being accessed correctly spelled?
  3. Is the Port number correct ? (if any). The browser will default http to 80 or http to 8080.

The user’s PC does not have Network connection
- Ask the user to access other sites to make sure his PC is in the Network. If he cannot access any sites, chances are his PC is having some issues.
The user’s PC is not able to resolve names (DNS)
- It is possible that the user’s PC is unable to resolve any names due to DNS issues. Have the user execute “nslookup <domain name>” in command prompt and ensure that he is able to resolve the domain name to an IP.
The client router's problem
- Advise client to find internal IT expert to fix their router until they able to connect to internet. Kindly inform customer we are not supporting internal problem such as router is reset, LAN cable doesn't work, problem with WIFI. We only support everything that related to EMP server and Port forwarding.
- If the server is accessible internally, but not accessible externally. You need to ask customer to open router page and login to the router's page and check the Port Forwarding setup. Get the router model and brand and proceed to research the configuration from google. If you can get their internal IT to fix port forwarding, that will be better.
The client's router is congested
- Root cause : A large number of network broadcasts, A large number of broadcasts can be an indication of a misconfigured or faulty NIC, host, switch, software, driver, etc. and can also be an indication of a malware infection somewhere in the network
- High usage from user's client internally such as downloading and buffering behavior that consumes too much connectivity content.
- We can advise customer to "restart" their modem and router. Remove the power cord from the modem and/or router, wait at least 10 seconds, and then plug the modem and/or router back in.
The client's connection between EMP application server and database server
- Root cause : There is some inconsistent connection between application and database server. You can tested by ssh to application server, and make connection from application to db server. Don't idle your connection. If you lost connection in between 10-15 minutes, there is problem with the network
- We can advise customer to install another segregated switch, with additional cables between the application and database server

Troubleshoot Internet connection problems
http://windows.microsoft.com/en-us/windows-vista/troubleshoot-internet-connection-problems
Guide to set DNS
https://developers.google.com/speed/public-dns/docs/using?hl=en

4

Public Internet

If you are connected from "Public Internet", you might experienced having difficulties of accessing URL which causes by:

Enhanced Protected Mode
Browser History
Browser Add-On
Proxy and DNS Settings
Browser need to be reset
Check whether a third-party service, program, or anti-virus is conflicting with browser
Temporarily disabling the firewall
Updating older drivers and editing registry key TabProcGrowth
Checking Windows Updates for drivers
Restore or refresh your PC

Refer to this wiki for solutions

https://support.microsoft.com/en-us/kb/956196

5

Data Center Network

Identify the customer.
- Always refer to complete customer server list https://wavelet.atlassian.net/wiki/pages/viewpage.action?spaceKey=WM&title=Server+List+-+LIVE
Getting server URL from the customer, sometimes it might be IP address. Note that don't confuse the server IP address with the IP address of local network which most of time is similar to 192.168.X.Y.
Identify the location of the server.
- Customer own host or Datae Center, please contact the person in charge from Data Center
  - Internet connectivity of data center or customer host ( Point 1 to 4)
  - Customer server condition.
  - ssh port is accessible? Port forwarding access
  - backup database can be restored and manually restored to wavelet cloud or AWS
- If customer server located at AWS, login to aws.amazon.com using (staging-wavelet) account and go to EC2 Instance and check the instance status
  - Amazon Production Server Account
    https://staging-wavelet.signin.aws.amazon.com/console
  - Refer to
- If customer located at Azure, login to
  - Azure Portal https://portal.azure.com/
  - Refer to /wiki/spaces/WU/pages/11567303
- If customer servers located at Wavelet Cloud office, refer to
  - /wiki/spaces/WM/pages/124715034

6

Server Network

A server network, or simply a network, is a collection of computers and other hardware components interconnected by communication channels that allow sharing of resources and information.
Basic knowledge of Networking, You don't need to be a CISCO Certified Engineer, but some basic knowledge about IP address, Sub-net mask, Gateway, DNS Server, Port Forwarding, Internal IP Address, External IP address of the modem or Router, how DYNDNS
Majority of customers located at AWS, to check the server network:
- Click the instance -> monitoring. Refer to network in and network out graph and observe
- If the network out too high, use netstat to observe which IP that causing high network out. kill and chmod the process will help
- Power up new instance and migrate the customer database using the new and safer instance security. This to be done after working hours
Troubleshot the server network
- Check IP Config
  - ifconfig (for linux customer)
  - ipconfig (for windows customer)
- Set Static IP via terminal
- Refer to :
  - https://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu
- Solve Session Time Out error in Firefox
- Refer to :
  - http://support.filesanywhere.com/hc/en-us/articles/214950806-Resolving-Session-Time-Out-Errors-
- Linux System Security
- Refer to :
  - https://linas.org/linux/secure.html

7

Server Operating Systems

Server Operating Systems : Ubuntu, Centos, Amazon Linux AMI

Useful LINUX command

- http://www.tecmint.com/useful-linux-commands-for-system-administrators/

- - The different between Ubuntu and Centos
    - Ubuntu : apt-get
    - Centos : yum

Try to convince customer to migrate from their own host server to our Amazon AWS cloud if their server operating system has problem.
- You can use this terminology
  - 1) Instead of putting the money in your safe deposit box, under your pillow / bed, people put money in the bank.
  - 2) Instead of building your own power generator, we use Tenaga Nasional electricity. If you just calculate the fuel you consume to generate the electricity, may be it is cheaper to operate your own power plant, but if you consider all costs involved, definitely more expensive.
  - In addition, may be you have more than just Wavelet EMP, may be you have other corporate server, like website, intranet, file server, or any other servers, you can host it all in the same cloud instance (if the operating system is linux)
The advantages of using AWS cloud is customer doesn't need to worry about:
- 1) Personnel cost to maintain the internet networking / hardware.
- 2) UPS Power
- 3) Electricity
- 4) No Worries about backup server
- 5) You don't need to worry about scaling
- 6) Everything is virtual, that means, at the click of one button, we can have disaster recovery off-site… to Japan / USA or other places.

8

Server Application Server

To trouble Shoot Server Slow Unresponsive, always refer to basic guide at /wiki/spaces/WU/pages/11568251

The steps to checks:

Server Log
- As root, vim /usr/java/jboss/server/default/log/server.log
- Scroll to the last line (shift+G) of the server log
- Check the exception message, copy the exception message, find the 1st occurrence of the exception (but disregard those regular exceptions for example the one caused by PDF printing “.. unable to forward…”
- If the message looks something like ‘.. I/O exception…’ or ‘.. unable to connect to the database …’, check the database right away (refer to point 9 Server Database Server)
- If the message looks something like ‘.. out of memory…’, check the Java Memory Allocation
CPU, Memory usage
- To check the usage, type :
  top
- You can display the details for the running processes, by typing c
- You can sort the processes by highest memory usage first, by typing m
- If the memory usage is mainly by PosgreSQL, refer to point 9
- If the CPU usage is high (>= 90%), type this command to check 3 to 5 process that consume high CPU:
  ps aux | sort -nrk 3,3 | head -n 5
- Kill the process that have high CPU
  kill -9 <process id>
- Usually, virus inside /tmp folder, if find virus inside /tmp folder
- Directly remove the virus by :
  cd /tmp rm -rf /tmp/*
- Put this command on crontab -e at other user instead of root user, because hackers usually attack by using root.
- For example user emp/empbackup/opswork. Please make sure the user that you choose to put crontab in is sudoers without having to put sudo password.
- Put this command on crontab -e at other user instead of root user, because hackers usually attack by using root (emp, empbackup, opswork)
- This command will delete file inside /tmp folder every one minute
  */1 * * * * sudo rm -rf /tmp/*
- Need to make sure doesn't need any password when type command:
  sudo su -
Operating System log
- it could be caused by hardware failure, which will be recorded into the O/S log. You can access the O/S log at /var/log/message
JBoss status
- This is no brainer, just make sure JBoss is actually running,
- And ONLY 1 instance of JBoss is running (if you are using docker container /wiki/spaces/WU/pages/13042094, /wiki/spaces/WU/pages/38928514)
- run command :
  ps aux | grep java
Jboss memory allocation
- If RAM usage is very high, check the memory allocation for jboss : /usr/java/jboss-4.0.1RC2/server/default/deploy/jbossweb-tomcat50.sar/server.xml
- See if the xms is too low or xmx too high - compared to the RAM size.
In most cases, customer is inpatient with server down, immediately offer to restart their server when idle in transaction happened. After restarting server, check the server log and troubleshot what was happened.
Checking idle transaction :
select *
from pg_stat_activity
where (state = 'idle in transaction')
select count(*), state, datname from pg_stat_activity where state = 'idle in transaction' group by datname, state order by count(*) desc
- If jboss-stop doesn't run anything
- use :
  ps aux|grep java
  kill -9 2304
  
  etc/init.d/postgresql-9.2.4 restart jboss-start
- Check previous date server log :
  cd /usr/java/jboss/server/default/log/
  ls -lhrt
  - Check the specific date of the server log
Missing run.sh and shutdown.sh from server
- Refer to /wiki/spaces/WU/pages/37945457

9

Server Database Server

Check to make sure you are able to connect to the database
- If you can’t, if it is a
  - App / DB setup
  - Check if network is up (refer to above points 1 to 6)
  - Check if database server is up
    - If you are not able to connect to the wsemp database, most probably your postgres has not started yet.
    - run command as root, /etc/init.d/postgresql-9.2.4 start and try psql command
  - Check if PostgreSQL is running
  - Check if the database server has rebooted, by using last. If this is occurring not during the cron script time (6am), ALERT FAULTY SERVER FAN!!!!
  - Combo setup
To look at running query
- To look at running query
  SELECT pg_stat_get_backend_pid(s.backendid) AS procpid,
  pg_stat_get_backend_activity(s.backendid) AS current_query
  FROM (SELECT pg_stat_get_backend_idset() AS backendid) AS s;
  Analyzing Idle In Transaction,
Postgresql Log
- Located at :
  /var/lib/pgsql/data/serverlog
- Errors are usually very apparent, they will tell you it's an error
- You can locate the related table / query that might be causing the problem
High Availability Load Balancing and Replication for PostgreSQL
Performance Tuning
When customer complain their reports run very slow, but after you repopulate the database to your localhost no such thing as slowness occurred. What you need to do
- Check the whether the VACUMM is running.
  - To check as empbackup users. run :
    vim backup.log
    you need to make sure the latest backup log shows the VACUMM is done.
- Inform customer that you will repopulate database again at night. In same cases, the slowness is gone after we re-populate their database. This can't be done at working hours unless customer ask you to
RDS connection
Can check on staging-account (RDS)
https://staging-wavelet.signin.aws.amazon.com/console
Click on "Show monitoring"
Select "DB Connection"
Need to make sure Cpu, Memory and Storage are not more than red line
If not more than red line, need to check Logs

If found out any process that weird, for example :

May 24, 2017 at 7:59:49 PM UTC+8

2017-05-24 11:00:15 UTC:172.31.26.7(38486):sbs@sbs:[59139]:ERROR: update or delete on table "bl_fi_mst_entity_hdr" violates foreign key constraint "app_master_entity_line_app_master_entity_header" on table "bl_fi_mst_entity_line"
2017-05-24 11:00:15 UTC:172.31.26.7(38486):sbs@sbs:[59139]:DETAIL: Key (guid)=(6DC2A9FE-364A-4C71-87F3-2C773CDF49E2) is still referenced from table "bl_fi_mst_entity_line".
2017-05-24 11:00:15 UTC:172.31.26.7(38486):sbs@sbs:[59139]:STATEMENT: DELETE
FROM bl_fi_mst_entity_hdr
WHERE name IN (SELECT ety_hdr.name AS ety_hdr_name_indicator
FROM bl_fi_mst_entity_hdr AS ety_hdr
GROUP BY ety_hdr.name
HAVING COUNT(1) > 1)
AND guid NOT IN (SELECT MIN(ety_hdr.guid) AS min_unique_guid
FROM bl_fi_mst_entity_hdr AS ety_hdr
GROUP BY ety_hdr.name
HAVING COUNT(1) > 1)

May 24, 2017 at 8:59:34 AM UTC+8
2017-05-24 12:00:16 UTC:172.31.26.7(38486):sbs@sbs:[59139]:ERROR: update or delete on table "bl_fi_mst_entity_hdr" violates foreign key constraint "app_master_entity_line_app_master_entity_header" on table "bl_fi_mst_entity_line"
2017-05-24 12:00:16 UTC:172.31.26.7(38486):sbs@sbs:[59139]:DETAIL: Key (guid)=(6DC2A9FE-364A-4C71-87F3-2C773CDF49E2) is still referenced from table "bl_fi_mst_entity_line".
2017-05-24 12:00:16 UTC:172.31.26.7(38486):sbs@sbs:[59139]:STATEMENT: DELETE
FROM bl_fi_mst_entity_hdr
WHERE name IN (SELECT ety_hdr.name AS ety_hdr_name_indicator
FROM bl_fi_mst_entity_hdr AS ety_hdr
GROUP BY ety_hdr.name
HAVING COUNT(1) > 1)
AND guid NOT IN (SELECT MIN(ety_hdr.guid) AS min_unique_guid
FROM bl_fi_mst_entity_hdr AS ety_hdr
GROUP BY ety_hdr.name
HAVING COUNT(1) > 1)

Those above error happened where the query is trying to delete data with guid 6DC2A9FE-364A-4C71-87F3-2C773CDF49E2 for one hour and it violates constraint.
It might come from ETL and DMS server.
How to handle :
Refer to :
- /wiki/spaces/WM/pages/130565063
RDS
- Check Queue Depth chart, if the maximum reaches 40. Some server might have problem with "Unable to connect to database"
- Go to RDS postgresql log file and find PSQL exception and long running queries. You need get help from development team to analyse the queries and how to prevent it
- Observe RDS db connection and running queries /wiki/spaces/WM/pages/130117823

common error is when postgres cannot start as postmasterpid has invalid data

To solve: http://intranet.wavelet.asia/projects/tech/wiki/When_postgres_cannot_start_as_postmasterpid_has_invalid_data

10

Server Hard Disk

Run this command
df -h
If the usage is more thatn 90%, you need to clear up some data by finding it using command
find / -type f -size +20M -exec ls -lh {} \; | awk '{ print $NF ": " $5 }'
HDD 100% might be caused by /var/log/cups/ . Refer to /wiki/spaces/WU/pages/20579213
Hard disk 100% also might be caused by server log.
Steps on how to handle if caused by server.log :
locate server.log
cd /usr/java/jboss-4.0.1RC2/server/default/log/
ls -lhrt
- if see big size of server.log
  - Gzip the server.log file
  - Transfer server.log file to s3
  - Then remove file of server.log that transfered
  - Hard disk usage might be free after that

Note: Don't delete anything inside /var/lib/pgsql/data, we stored database inside that folder. It's normal for that folder have big size

11

Wavelet EMP Itself

Server slowness can be found only from specific module from EMP caused by Inefficient code or SQL queries

Get from customer which module causing slowness and test from your personal PC or Laptop whether the module is really slow. Run command showlog, observe whether there is any exception error found. If you are not confident on the code, get any programmer to sit beside you and help to troubleshoot it.
- Variety of exceptions:
  - org.postgresql.util.PSQLException. Solutions: copy the query before that error appeared and report to programming team to fix the SQL query.
  - nullpointerExecption. Solutions: Repopulate the database to your localhost. Run the debug or your eclipse and found which table or queries causing null pointer exception. In most cases, error will be solved if you find which table has null data and update it to some values depends on the "Type". Example, if the Type is integer, you can update to 0, If it is string, update to '' or empty string
Common cases is that Modules is missing from EMP. Please refer to below wiki to fix the problem.
- GUID exception error, UnknownHostException: Menu toolbar missing after reboot

12

Usage of EMP

Sometimes users doesn't aware with the usage of generating report inside EMP is one of the cause of Server Slowness problem and they tend to generate report with long date range and heavy data at peak operating hours
Clicking "generate report" button multiple also causing slowness. It is because everytime, user clicked the generate report button. The function will call database and pass all value that is shown in report. If user mistakenly click 5 times, the function will do the same thing 5 times and might result Out of memory problem and idle transaction.
Solutions:
- Advise customer to generate heavy report out of peak hours example after 8PM and before 9AM
- Be careful when choosing the date range. Report that includes serial number tend to show a lot of data and entries. Advise customer to choose maximum 1 week date range only
- If customer denied that he/she generated heavy report when slowness occured, go to server.log and shows the proof to the person in charge.
  - Include username, date, report name, date range generated screenshot. You may use point 9 to check