Part 2
Part 2. Basic Query Structure
References: - ClickHouse 'WHERE' documentation - ClickHouse 'Functions for working with strings' documentation
You may want to run queries which include logic to return results only matching certain values. these are specified using a WHERE clause.
For this exercise, we will use = , AND, NOT, OR and LIKE.
NOTE: '=' is used to specify an exact match and is used for all data types. You MUST put string values, IP addresses, number ids inside single quotes (ie source_hostname = ‘SenseOn’). For columns that return INT32 or INT64, the quotes can be omitted. When manipulating numbers, the usual BIDMAS rule of order applies, use brackets to ensure the right operators apply in the correct order.
When chaining multiple conditions, a logic operator AND, OR must be specified at the start of each new matching condition. You can embed these in brackets eg time_observed > 1058926028 AND (column = value 2 OR column = value 3) means it must equal 2 or 3 and the time must be greater than the number given. Without brackets, it will return value = 2 when the time condition is also met, but will return all matches where its value = 3 regardless of the time.
IMPORTANT: For the sake of speed and for all queries in questions to work , include the line
_time_observed >= toUnixTimestamp(subtractHours(now(), 24)) * 1000
Straight after the WHERE clause. This and other time operations will be explained further down the test.
Structuring WHERE statements
The structure for WHERE statements is:
SELECT
WHERE
<column name> <operator> <matching conditions>,
<logic operator (AND.OR)> <column name> <operator> <matching condition>
ORDER BY
NOTE: to use the LIKE operator, enter the wildcard value ‘%' before and/or after the text you want to match. This allows you to easily specify certain text within an entry you want to return. WHERE name LIKE ‘%.exe’ - will return all entries in the name column that end with .exe , regardless of the text in front of that. WHERE name LIKE ‘%edge%’ will return all entries that have ‘edge’ somewhere in the text, regardless of what is in front of or after it.
2a. Common WHERE operators
Using the = operator
= is used to return an exact match of a condition.
Write a query which returns the time observed, hostname , eventid and data from the endpoint_windows_event_log table where eventid = ‘4624’, with a limit of 20
(REMEMBER: use the _time_observed >= toUnixTimestamp(subtractHours(now(), 24)) * 1000 line.
Note: If no results are returned, remove the eventid check
Answer
SELECT _time_observed, _hostname, eventid, data
FROM endpoint_windows_event_log
WHERE
_time_observed >= toUnixTimestamp(subtractHours(now(), 24)) * 1000
AND
eventid = '4624'
LIMIT 20
Using the LIKE operator
LIKE can be used where a partial match is needed. Using a % on either (or both) sides of the matching condition can be used to find this within strings. For example, name LIKE 'bat%' will return batman and batmobile but not acrobat.
Write a query that returns a list of the _hostname, name,command and parent name from the endpoint_process table, for 20 entries containing ‘dll’ somewhere in the name, ordered alphabetically by the hostname in the last 24 hours.
Hint - operator
The LIKE operator used on the `name` column with '%dll%' would be useful here.Answer
SELECT _hostname, _platform, name, command, parent_name
FROM endpoint_process
WHERE
_time_observed >= toUnixTimestamp(subtractHours(now(), 24)) * 1000
AND
name LIKE '%dll%'
ORDER BY _hostname asc
LIMIT 20
Next, modify the query so it returns a distinct list of hostname, process names and parent_names in the past 12 hours, ordered alphabetically by hostname, for records containing dll in the process name
Answer
SELECT distinct _hostname, _platform, name, command, parent_name -- add distinct
FROM endpoint_process
WHERE
_time_observed >= toUnixTimestamp(subtractHours(now(), 12)) * 1000 --change to 12h
AND
name LIKE '%dll%'
ORDER BY _hostname asc
LIMIT 20
Using the NOT operator
NOT LIKE can be used to exclude results. We will next use this alongside other conditions to form a more complex query.
Now, we'll look through the endpoint_program table to find specific Microsoft-published programs, but not with .NET or C++ in the package name.
Adjust the query to look at the ‘endpoint_program’ table , returning a distinct list of the hostnames, and name where the publisher name contains the string ‘Micro’ but not ‘.NET’ and not ‘C++’ in the name in the past 24 hours. Limit this to the first 100 results. Do not select the 'publisher' for display - but still use this column in the where clause for filtering.
Answer
SELECT distinct _hostname, name
FROM endpoint_program
WHERE
_time_observed >= toUnixTimestamp(subtractHours(now(), 24)) * 1000
AND
publisher LIKE '%Micro%'
AND
name NOT LIKE '%.NET%'
AND
name NOT LIKE '%C++%'
ORDER BY _hostname asc
LIMIT 100
2b. Case sensitivity
SQL databases are case sensitive, however we can make it return all results (regardless of case) by using LOWER or UPPER , followed in brackets by the column name EG lower(name) = ‘text in lower case'.
Write a query that returns the hostname, name, and parent_name for records that contain ‘excel’ in the name in the past 3 days from the endpoint_program table. Notice that there aren’t any (or few)
Answer
SELECT _hostname, name, parent_name
FROM endpoint_process
WHERE
_time_observed >= toUnixTimestamp(subtractHours(now(), (24*3))) * 1000 -- changed to 3 days
AND
name LIKE '%excel%'
LIMIT 20
Now, use lower on the name column to ignore case sensitivity.
Hint
You'll need to use `lower (name)`Answer
SELECT _hostname, name, parent_name
FROM endpoint_process
WHERE
_time_observed >= toUnixTimestamp(subtractHours(now(), (24*3))) * 1000
AND
lower(name) LIKE '%excel%' --added lower
LIMIT 20
You should notice we now get results for EXCEL as well as excel now, too.
2c. Mathematical operations in the SELECT clause
Mathematical operations can be performed in the WHERE clause, which is the most common place they will be used. However, they can be used in certain situations within the SELECT clause when dealing with data of a numerical type.
Write a query that looks back 24 hours in the network flow table , with the time , hostname, destination host, application protocols and data downloaded and uploaded. Order by the most data received, converting the bytes into megabytes (use the \ function to divide). Give all the columns friendly names, and set a limit of 50 results
Answer
```sql SELECT _time_observed AS "Time Observed", _hostname AS "Hostname", dest_hostname AS "Destination Hostname", application_protos AS "Application Protocols", source_bytes_sent/1000000 AS "Source MB sent", dest_bytes_sent/1000000 AS "Destination MB sent" FROM network_flow WHERE _time_observed >= toUnixTimestamp(subtractHours(now(), 24)) * 1000 ORDER BY dest_bytes_sent desc LIMIT 50 ```Write a query which shows a list of distinct USB devices, also with the host, platform, hardware model, partitions, type, and the capacity (converted to MB) - order by this column, largest disk first. You'll need to find the table to use in Table Reference.
Hint - which table to use
The table that would help us here is `endpoint_disk_info`Hint - which column to use for distinct
The column `_hardware_id` would be a good one to use for distinct values.Answer
```sql SELECT DISTINCT _hardware_id, _hostname, _platform, hardware_model partitions, type, disk_size/1000000 AS "Disk Size (MB)" FROM endpoint_disk_info WHERE _time_observed >= toUnixTimestamp(subtractHours(now(), 24)) * 1000 ORDER BY disk_size desc ```Great! Now you can move onto Part 3 of the Hunt Lab training
Part 3 - coming soon!