Splunk SAI and Metrics

I recently had a play with Splunk SAI and I wanted to monitor a number of Linux Systems.

My options to collect metric data from the systems was to use collectd and send the data to HEC OR use Luke Harris’s TA-Metrics add-on https://splunkbase.splunk.com/app/4856/ and send the data to the indexers.

I opted for the metrics add-on as this seemed so much easier, the SAI has made it easier to deploy collectd + agent, but this TA to me was even easier, and it supports SAI.

I created a linux_metrics index in Splunk

homePath = $SPLUNK_DB /linux_metrics/db
coldPath = $SPLUNK_DB /linux_metrics/colddb
thawedPath = $SPLUNK_DB/linux_metrics/thaweddb
datatype = metric
frozenTimePeriodInSecs = 2419200

The inputs.conf from the default folder is already set for you, its polls every 5 minutes for various metrics. I then created a local folder in the TA-linux-metrics folder and copied the process_mon.conf with below configuration.

allowlist = CROND,run*,systemd*,chronyd,rsyslogd,auditd,journal,su,splunk,gnome-session,NetworkManager,dnsmasq-dhcp,dnsmasq,nm-dispatcher,snmpd,network,crond,accounts-daemon,gdm

I got the above process list from using the nix TA https://splunkbase.splunk.com/app/833/  I exported the process’s into a CSV file using a simple SPL as below

index=linux process=* sourcetype=top
| dedup process
| table process

After the config, I used the deployment server to deploy the TA-metrics to all the Linux systems, I could then see the data in the Analytics workspace in Splunk.

I installed SAI on to the Search Head – https://splunkbase.splunk.com/app/3975/

I install the SAI Infrastructure add onto the indexers https://splunkbase.splunk.com/app/4217/

Configured the macro sai_metrics_indexes to point to the linux_metrics index and I could then see metrics data with then entities in SAI App.

A Bigup to Luke’s TA – makes collecting metrics it so much easier and the SAI is great to monitor OS systems.

Windows Services Monitor In Splunk

Placeholder Image

This is a quick way of monitoring your Windows Services

So after ingesting Windows data via the https://splunkbase.splunk.com/app/742/ version 4.8.4, I wanted to see what services were set to auto but not running.

I wanted to ensure any important services were up and actually running, so by running the below search I could capture these services from some test hosts.

SPL: index=windows sourcetype=WinHostMon  Name=* StartMode=Auto State=Stopped | stats values(DisplayName) by host

From the search I could see the SNMP and Firewall services were stopped but should be running.

The below is part config from the Windows Add-On inputs.conf which collects the data. Set these to every 300 seconds (5 mins). Once configured deploy it to some Windows test nodes which run the Universal Forwarder and do some search tests.


Ensure you have deployed the add-on props and transforms to the search heads / indexers for the parsing, otherwise you won’t see the field names.

###### Host monitoring ######
interval = 300
disabled = 0
type = Process
index = windows

interval = 300
disabled = 0
type = Service
index = windows



Splunk Index Cluster Config

Placeholder Image

This is how to setup a Splunk cluster, search heads and replicate an index.

(The below is a very simple setup and good for lab work, since passing my Core Implementation I would advise you to apply best practise when setting up a splunk cluster, typically you need three indexers and three search heads as minimum, you can also use the base configs and not use the gui method. If you dont have the resources for cluster then use the distributed method)

My Lab Environment:

  • 5 x RHEL6.5 Servers
  • Splunk 7.1 Enterprise
  • 1 x master (This controls the peers)
  • 2 x index servers (This is where the index files (log data is stored)
  • 2 x search heads (These are the servers used for searching the data)

After building the RHEL servers and installing Splunk 7.1, you start by configuring the master server.

Step 1 On the master server / login as admin and go to settings > Indexer clustering > click on Enable Indexer clustering and set to Master Mode.


Step 2 Set as below:

  • Replication Factor = 2 (this is because there are 2 nodes acting as peers)
  • Search Factor = 2
  • Security Key = (Choose a password)
  • Cluster Name = splunk_lab


Press enable and Splunk will restart.

Step 3 Login to the Splunk peer node 1 server as admin  and go to settings > Indexer clustering > click on Enable Indexer clustering and set to Peer mode


Step 4 Then set the below

  • Master URI = (Master FQDN)
  • Peer replication port 8080
  • Security Key = (Password from master config)


Step 5 Press enable peer node and press Restart now. (The service will restart)

Step 6 Do the same for the second peer node

Step 7 Login tothe master node and check the config. (Settings > Indexer clustering)


Step 8 Configure the index of your choice to be replicated – I created one as below On the master (opscx1 server) go to the folder /opt/splunk/etc/master-apps/_cluster/local  and create a indexes.conf file and add the below.


Step 9 Push the configuration from the master node

On the master node, in Splunk Web, click Settings > Indexer Clustering.
The Master Node dashboard opens.

Click Edit > Configuration Bundle Actions.

Click Validate and Check Restart > Validate and Check Restart.
A message appears that indicates bundle validation and check restart success or failure


Step 10 Generate some data using my scripts and load the file or create a monitor -see my earlier post on how to do this.


Step 11 From the Master check the data (you would normally configure the search heads for normal use, this is just to check the data)


Step 12 Check the indexes file on the peer nodes it should be as the one on the master


cat /opt/splunk/etc/slave-apps/_cluster/local/indexes.conf

Step 13 Add search heads

Login to one of the search heads  and go to > Settings > Indexer Clustering > enable Indexer clustering


Press next and add the below
Master node = FQDN of your master node
Security Key = (From master node config)


Enable search head node and restart Splunk
Log back into the search head and you should get a similar screen shot as below


Run a search on your index and you should get the results as before

index=”dc_security” (This is the name of your index)


Do the same on the other search head server, then login to the master and you should see both search heads.




Splunk / Python Script / Syslog Demo Data

Placeholder Image

Officially now a Splunk Certified User: (Splunker!)


With that in mind I thought I’d create a demo script to load some log data into Splunk, this is to show the data and some charts.

As the data is based on time series (Time Stamped) you’ll be able to see information from the log entries and a few charts.

The use case could be to show all syslog data from all pre-defined critical servers, then run searches against the data for particular log events / messages.

A typical search could be “Failed from ip_address”

I created a python script, which generates a log file called dc_security.log, the script has two options, one to quickly generate data or one with some delays so you can leave it for a while to build the data and show this over time in Splunk. The log data is based on syslog format and have put various messages in the log events.

Example Data:

Dec 2 02:51:31 LINUX_SRV3 user joker has tried to login to this server and failed from ip_address

After you download the script run the script – sudo python ./dc_security_v1.0.py and select either option.

Download the script from Github



After the file has been created, copy the file: dc_security.log to a server from where you can access Splunk web gui and upload the data into an index called dcsecurity or an index of your choice.

Splunk > Add Data > Upload Data > drag the file into the target box > next


Set the source type – as the log file is like syslog entries, choose > Operating systems > linux_message_syslog

Press Next > Review > Submit > Start Searching

From the search type the below query and you should get a display of all the logs from that file



Here’s a few charts based on the data

Search 1: This searches for the message failed AND from ipaddress:

sourcetype=linux_messages_syslog failed AND from ip_address



Search 2 : This search query shows which server has had the most failed login attempts You can see there are 3500 events, this could be an IOC (Indicator Of  Compromise) like a brute force attack.

It’s also showing the source of the IP, based on this you would run your security process and actions.

sourcetype=linux_messages_syslog failed AND from ip_address | top host


Hopefully this shows you how to get some log data into Splunk, then run some quick searches, and create some charts for insights.

In a production environment you would use forwarders to collect and forward the syslog data on an ongoing basis.



Placeholder Image

SIEM (Security Information and Event Management), this is another area of monitoring that has come into the foreground of monitoring, with all the hacks, this should be the top priority for all organisations.

The aim here is to consolidate logs from many different sources in real time, this is typically (structured /unstructured, firewall, snmp, dns, switch, router, ldap, ids, apache/iis, database and application logs), once this data is received and ingested into the tools engine, it is normalised or index it in such a way that you can run searches, create time series charts, correlate, retain the data for compliance, alert all in real time, these are some of the features and they work really well and fast at it. The reason it’s so fast is that they don’t use the traditional database schemes for accessing the data you want, the datasets tends to be on files which are compressed and indexed and thus the tools can search the data, a bit like googling.

Data is typically collected via agents which come in one form or another, regex can be used to parse the data or the rawdata can also be ingested.  As the time stamp is typically recorded on the logs, it’s often use as part of the time series charts.

I started to have a look at two SIEM solutions, as they seem to be the ones many people talk about these days, but there are many other logging solutions and they are not for just SIEM, you can also get business related data, such as how many users bought this product from logs.

I wanted to get a feel for them, and find out how easy or complex they were to use and present some data. This is not a full on review, but I always try to put on my admin hat from many years ago to see how it works under the hood and if I would take to it.

Basic Assessment Criteria:

  • Easy of install and config
  • Easy of getting data into the engine ( Apache / Security logs is what I looked into)
  • Easy of presenting the data
  • Documentation Quality

Both on RHEL6.5

Splunk 7

Install and config on my RHEL65 server was easy, I configured a forwarder to point to the logs I wanted to and point it to the Splunk server which was doing all the collecting, indexing and parseing, it has a feature which will detect which type of data source you are from the logs.  Once the data was in I could run my searches, the data source was linux_secure

This simple query shows the data from the /var/log/messages log file

index=ops_security_idx* “Failed password”


This simple query shows the data in a chart (I know it’s only from server, but if you a thousand servers you could scan the data in a short time and find which are the top ones are having logging failure, this could show which servers are the ones someone is trying to get into, simple but very effective)

index=ops_security_idx* “Failed password” | top host  (this uses my index called ops_security and searches for the string “Failed password” then pipes it to the function top which then uses the host field.



Install and config on my RHEL65 server was fairly easy, I had to download a number of rpm files (elasticsearch/logstash/kibana/filebeats) and set up a few dependencies. Once installed I had to use a number of configuring files to get the component’s working, this included the filebeats (agent) to get some data into the elasticsearch engine.

Once the data was in I could run my searches, the data source was linux_secure


Both SIEM tools are good, but in today fast paced world of monitoring, if you want something that fits into the devops mode of operation and get value quickly, then Splunk would be my first choice, if you have time for set up and configuration then Elastic would be just fine, both are good tools.  They had good documentation and forum support, they have free versions with limitations. I found Splunk to be easier to install and configure in my lab but for the enterprise you would need to consider many other variables. They are many add-ons via apps and components to enhance the solution to your needs and a lot of good knowledge in the community.

At the enterprise both tools would need to scale for your use cases so a good performance and sizing exercise would need to be carried out as the volume of data ingested will start to grow very quickly, and if the infrastructure is not fit for purpose it can lead to issues in the long run. The infrastructure that supports these tools needs to be future proofed and be easy to scale out should you need to.

Both had cloud (SaaS) based offerings and easy to set up for a trail period.

After my tinkering, I realised by applying the same project principles as I did for systems and application monitoring projects, you can successfully implement SIEM based tools by following a few key steps.

  1. Analyse from all the organisations stake holders what log data they would like to use for log analysis and define the use cases.
  2. Create a charter as to what business, technical, functional and non-functional requirements are and define the logging strategy.
  3. Select the top SIEM playes on the market – Gartner has good articles on this, then run a POC.
  4. Based on the POC create a design (lld/hld), define the infrastructure, data sources, web, security logs etc, don’t log everything, only data you’re interested in,
  5. Train and develop skills for admins and users.
  6. Deploy as per the design in phased manner. Phase 1 should not include everything, only use the critical features functions required, but add more over time, and grow the solution, this way it will be a mature solution.
  7. Govern and maintain the solution – speed is everything for logging solution, so ensure the infrastructure is performing as it should, and retire old configurations and apps that are not in use. Don’t leave it to sort itself out, if you no longer monitor various components, ensure that data is not being collected and keep it up-to-date.