Beyond Outlier Detection: LookOut for Pictorial Explanation
This is the main file to be run. It takes in user arguments and facilitates the call of different functions which take in these passed arguments.
The flow consists of:
The user arguments are explained as follows:
-f
| --datafile
: The file with data to fit on the model-t
| --trainfile
: The file with data to train the model-l
| --logfile
: The logfile ; default - log.txt-df
| --datafolder
: The folder containing the datafile and trainfile ; default - Data/-lf
| --logfolder
: The folder containing the logfiles ; default - Logs/-pf
| --plotfolder
: The folder into which to output the plots ; default - Plots/-d
| --delimiter
: The csv datafile delimiter ; default - “,” -b
| --budget
: Number of plots to display ; default - 3-n
| --number
: Number of outliers to choose ; default - 10-p
| --pval
: Outlier score scaling factor ; default - 1.0-s
| --show
: Specify if all generated plots are to be stored in the plotfolder ; default - false-bs
| --baselines
: Specify if you want to run the baseline algorithms ; default - false-mrg
| --merge
: Specify if the global set of outliers will be picked from a merged ranklist ; default - false-if
| --iforests
: Specify if the global set of outliers will be picked using iForests ; default - false-dict
| --dictated
: Specify if the global set of outliers will be dictated (see feature_file.py) ; default - falseThis file takes the bipartite graph between outliers and plots as input and runs the LookOut algorithm to obtain the b (budget) best plots.
There are also two other baselines that can be run to compare with the Lookout algorithms, namely Greedy TopK and Random Selection.
This file creates an iForests model. It trains the model via the training data and then attempts to score the test data on the trained model.
A further improvements on the scores can be done by also further training another model on the test data itself and generating new test scores.
Then the two obtained scores can be intropolated for best results.
The objective of this file is to process the outlier scores from each pair-wise scatter plot and generate an output data matrix that can be used to populate the bipartite graph.
We generate two types of output matrices, scaled_matrix and normal_matrix.
The scaled_matrix is equivalent to the normal_matrix, except with all the scores scaled by a factor pval with help of a scaling function defined in helper.py.
This file returns the global outlier objects that serve as the points-of-interest for the user. These global outliers can be calculated via three methods.
merge
: The ranklists, of all the points, from their individual 2-Dimensional iForest scores will be merged into a single ranklist based on a merging algorithm.iforests
: The set of n outliers will be chosen by running the iForests algorithm in the complete multidimensional feature space. The top n scored points will be chosen.dictated
: The user will define the set of outlier ids in the features_file.py. In this case the number of outliers chosen will be equal to the length of the outlier id list.This file along with test.py control the flow of the program. This file particularly looks at creating the environment to run the LookOut algorithm.
This file contains various helper functions that are used by several of the algorithm files. It serves a dual purpose of removing unnecessary logic from the main files and also increases reusability of code.
Broadly the helper functions can be grouped as:
This file contains functions that help create different type of plots that might be useful to the user. It consists of the four mian functions:
This file declares four classes that are used by the LookOut algorithm to calculate the best visualization plots.
This file declares two classes that deal with data representation
This file contains parameters that specify the display variables such as the terminal color prompts, and the styling of standard logging functions.
This file include all the data and feature specific variables:
identity_field
is used to declare the identity object columnidentity_is_time
is used to specify that the identity object is time based. Will be used to make the time-series graphsentry_limit
defines the lower limit for the number of entires of an object from the identity_field. An object with fewer entries than the limit will be ignored.time_series_data
defines that the data is temporal in nature or not with time based data entriestimestamp_field
declares the column name that contains the timestamp dataaggregate_fields
is a list of the aggregate field columnsobject_fields
is a list of the object field columnsnorm_field
declares any aggregate column to use as the base. All other aggregate fields will be normalized based on this base column.outlier_list
is used to particularly observe the characteristics of certain points of focus. This is a list of the object ids of those objects. This file is used to create a datafile (.csv) with desired entries. The user can provide their preferences via the following options:
-t
| --team
: The team id ; default - 15 (LimeStone) -p
| --product
: The product id ; default - 2 (Futures)-v
| --venue
: The venue id ; default - 23 (CME)-s
| --sid
: The symbol id ; default - 0-y
| --year
: The year of historical file ; default - 2018-m
| --month
: The month of historical file ; default - 5-d
| --day
: The day of historical file ; default - 29 -b
| --bucket
: The bucket size (data sampling rate in seconods) ; default - 30-pr
| --period
: The periodicity of the data ; default - 0 (1 day)It makes a call to the elastic search engine with all the above parameters and generates a corresponding file (csv) which is placed in the Data folder.
This file is used to create a datafile (.csv) with desired entries. The user can provide their preferences via the following options:
-f
| --datafile
: The file from which to extract data ; default - “” -t
| --targetfile
: The file to export the data ; default - target.csv-m
| --mode
: specify type of extraction {full, partial, random} ; default - full-p
| --portion
: Fraction of data to extract ; default - 1.0-i
| --include
: Specify columns to include ; default - all-e
| --exclude
: Specify columns to exclude ; default - noneIt reads a csv file and generates a corresponding target file (csv) which is placed in the Data folder. It should be used in conjunction to the create_files.py to further selectively modify the data.
This file is used to extract features from the read datafile. It makes use of the python pandas library and transforms the data to create feature data objects.
There are four main processing steps in this file:
entry_limit
variable declared and defined in feature_files.py is one such filter.entry_limit = 0
IDs
and COUNT
LIFETIME
, IAT_VAR_MEAN
, MEAN_IAT
and MEDIAN_IAT
. In case of aggregated time series data we wont create time series features.FIELD
(summed) and stddev_FIELD
. Note) we currently work with aggregated data so the stddev_FIELD
features can be obtained from the train file.FIELD
(unique count).There are two stages to run the code: Data Preparation and Run LookOut
Call on create_files.py to get a readable csv file for a particular team, venue, product and date. Read here for argument specification.
Note) You will have to set the file path HIST_DATA_DIR
to a suitable location to access the raw files.
python create_files.py -t <team_id> -y <year> -m <month> -d <day> [args list]
Call on extract.py to further filterout unwanted columns. Read here for argument specification.
Note) This step can be ignored if no columns have to be deleted
python extract.py -f <datafile> -t <targetfile> [args list]
e.g. Lets extract certain columns from the data
python extract.py -f <datafile> -t <targetfile> -i ['orders', 'cancels', 'trades', 'buyShares', 'sellShares']
Modify feature_file.py to the appropriate specification. Current default values should work okay.
identity_field = 'ts_epoch'
identity_is_time = True
entry_limit = 0 # The minimum entries that must exist per identity item (Set to 0 to disable)
time_series_data = False # Calculates lifetimes and IAT data (should be set to false if only one entry per identity)
timestamp_field = 'TIMESTAMP' # Used only if time_series_data is set to true
aggregate_fields = ['orders', 'cancels', 'trades', 'buyShares', 'sellShares', 'buyTradeShares', \
'sellTradeShares', 'buyNotional', 'sellNotional', 'buyTradeNotional', \
'sellTradeNotional', 'alters', 'selfTradePrevention']
object_fields = []
norm_field = 'orders'
outlier_list = []]
Run the LookOut Algorithm on the files generated from the Data Preparation stage with the configuations specified in the feature_file.py. Read here for argument specification.
python test.py -f <filename> -t <trainfile> -b <budget> -n <number> [-mrg | -if | -dict] (-if recommended) [args list]
The -f (filename) and -t (trainfile) must be specified by the user.
The -b (budget) and -n (number of outliers) have default values of 3 and 10 respectively. They don’t have to be specified, but it is good practice to declare them.
It is recommended to use the -if (iForests) for best results in calculating global outliers. Look here for more information.
The outputs of the algorithm are reflected in the Plots folder. There will be three tyoes of files here:
Let us look at some LookOut-n-b-i-(train).png with n = 6 and b = 3
![]() |
![]() |
![]() |
---|---|---|
LookOut-6-3-0.png | LookOut-6-3-1.png | LookOut-6-3-2.png |
![]() |
![]() |
![]() |
---|---|---|
LookOut-6-3-0-train.png | LookOut-6-3-1-train.png | LookOut-6-3-2-train.png |