My Exploratory Data Analysis for Microsoft Malware Prediction

Competition Information
This is the second Microsoft Malware Hosted competition on Kaggle. For more detials, please see here.
The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways.With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security. As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.

Goals
The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft’s endpoint protection solution, Windows Defender. Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.

About This Kernel
In this kernal, I would like to recored my personal EDA for Microsoft Malware competition.

Imported Packages

1
2
3
4
5
6
7
8
9
10
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import time
import sys
import datetime

plt.style.use('seaborn')
pd.set_option('display.max_columns', 500)

Data Types

This part was taken from the helpful kernel.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
dtypes = {
'MachineIdentifier': 'category',
'ProductName': 'category',
'EngineVersion': 'category',
'AppVersion': 'category',
'AvSigVersion': 'category',
'IsBeta': 'int8',
'RtpStateBitfield': 'float16',
'IsSxsPassiveMode': 'int8',
'DefaultBrowsersIdentifier': 'float16',
'AVProductStatesIdentifier': 'float32',
'AVProductsInstalled': 'float16',
'AVProductsEnabled': 'float16',
'HasTpm': 'int8',
'CountryIdentifier': 'int16',
'CityIdentifier': 'float32',
'OrganizationIdentifier': 'float16',
'GeoNameIdentifier': 'float16',
'LocaleEnglishNameIdentifier': 'int8',
'Platform': 'category',
'Processor': 'category',
'OsVer': 'category',
'OsBuild': 'int16',
'OsSuite': 'int16',
'OsPlatformSubRelease': 'category',
'OsBuildLab': 'category',
'SkuEdition': 'category',
'IsProtected': 'float16',
'AutoSampleOptIn': 'int8',
'PuaMode': 'category',
'SMode': 'float16',
'IeVerIdentifier': 'float16',
'SmartScreen': 'category',
'Firewall': 'float16',
'UacLuaenable': 'float32',
'Census_MDC2FormFactor': 'category',
'Census_DeviceFamily': 'category',
'Census_OEMNameIdentifier': 'float16',
'Census_OEMModelIdentifier': 'float32',
'Census_ProcessorCoreCount': 'float16',
'Census_ProcessorManufacturerIdentifier': 'float16',
'Census_ProcessorModelIdentifier': 'float16',
'Census_ProcessorClass': 'category',
'Census_PrimaryDiskTotalCapacity': 'float32',
'Census_PrimaryDiskTypeName': 'category',
'Census_SystemVolumeTotalCapacity': 'float32',
'Census_HasOpticalDiskDrive': 'int8',
'Census_TotalPhysicalRAM': 'float32',
'Census_ChassisTypeName': 'category',
'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float16',
'Census_InternalPrimaryDisplayResolutionHorizontal': 'float16',
'Census_InternalPrimaryDisplayResolutionVertical': 'float16',
'Census_PowerPlatformRoleName': 'category',
'Census_InternalBatteryType': 'category',
'Census_InternalBatteryNumberOfCharges': 'float32',
'Census_OSVersion': 'category',
'Census_OSArchitecture': 'category',
'Census_OSBranch': 'category',
'Census_OSBuildNumber': 'int16',
'Census_OSBuildRevision': 'int32',
'Census_OSEdition': 'category',
'Census_OSSkuName': 'category',
'Census_OSInstallTypeName': 'category',
'Census_OSInstallLanguageIdentifier': 'float16',
'Census_OSUILocaleIdentifier': 'int16',
'Census_OSWUAutoUpdateOptionsName': 'category',
'Census_IsPortableOperatingSystem': 'int8',
'Census_GenuineStateName': 'category',
'Census_ActivationChannel': 'category',
'Census_IsFlightingInternal': 'float16',
'Census_IsFlightsDisabled': 'float16',
'Census_FlightRing': 'category',
'Census_ThresholdOptIn': 'float16',
'Census_FirmwareManufacturerIdentifier': 'float16',
'Census_FirmwareVersionIdentifier': 'float32',
'Census_IsSecureBootEnabled': 'int8',
'Census_IsWIMBootEnabled': 'float16',
'Census_IsVirtualDevice': 'float16',
'Census_IsTouchEnabled': 'int8',
'Census_IsPenCapable': 'int8',
'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16',
'Wdft_IsGamer': 'float16',
'Wdft_RegionIdentifier': 'float16',
'HasDetections': 'int8'
}
1
2
%time train_data = pd.read_csv("data/train.csv", dtype=dtypes)
%time test_data = pd.read_csv("data/test.csv", dtype=dtypes)
Wall time: 2min 7s
Wall time: 1min 51s

Exploratory Data Analysis

1
train_data.head(20)

‘HasDetections’ Distribustion

1
train_data['HasDetections'].value_counts()
0    4462591
1    4458892
Name: HasDetections, dtype: int64

Missing Vaules

Missing Values in Training Data

1
2
3
4
total = train_data.isnull().sum().sort_values(ascending = False)
percent = (train_data.isnull().sum()/train_data.isnull().count()*100).sort_values(ascending = False)
missing_train_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_train_data.head(50)
Total Percent
PuaMode 8919174 99.974119
Census_ProcessorClass 8884852 99.589407
DefaultBrowsersIdentifier 8488045 95.141637
Census_IsFlightingInternal 7408759 83.044030
Census_InternalBatteryType 6338429 71.046809
Census_ThresholdOptIn 5667325 63.524472
Census_IsWIMBootEnabled 5659703 63.439038
SmartScreen 3177011 35.610795
OrganizationIdentifier 2751518 30.841487
SMode 537759 6.027686
CityIdentifier 325409 3.647477
Wdft_IsGamer 303451 3.401352
Wdft_RegionIdentifier 303451 3.401352
Census_InternalBatteryNumberOfCharges 268755 3.012448
Census_FirmwareManufacturerIdentifier 183257 2.054109
Census_IsFlightsDisabled 160523 1.799286
Census_FirmwareVersionIdentifier 160133 1.794915
Census_OEMModelIdentifier 102233 1.145919
Census_OEMNameIdentifier 95478 1.070203
Firewall 91350 1.023933
Census_TotalPhysicalRAM 80533 0.902686
Census_IsAlwaysOnAlwaysConnectedCapable 71343 0.799676
Census_OSInstallLanguageIdentifier 60084 0.673475
IeVerIdentifier 58894 0.660137
Census_PrimaryDiskTotalCapacity 53016 0.594251
Census_SystemVolumeTotalCapacity 53002 0.594094
Census_InternalPrimaryDiagonalDisplaySizeInInches 47134 0.528320
Census_InternalPrimaryDisplayResolutionHorizontal 46986 0.526661
Census_InternalPrimaryDisplayResolutionVertical 46986 0.526661
Census_ProcessorModelIdentifier 41343 0.463410
Census_ProcessorManufacturerIdentifier 41313 0.463073
Census_ProcessorCoreCount 41306 0.462995
AVProductsEnabled 36221 0.405998
AVProductsInstalled 36221 0.405998
AVProductStatesIdentifier 36221 0.405998
IsProtected 36044 0.404014
RtpStateBitfield 32318 0.362249
Census_IsVirtualDevice 15953 0.178816
Census_PrimaryDiskTypeName 12844 0.143967
UacLuaenable 10838 0.121482
Census_ChassisTypeName 623 0.006983
GeoNameIdentifier 213 0.002387
Census_PowerPlatformRoleName 55 0.000616
OsBuildLab 21 0.000235
LocaleEnglishNameIdentifier 0 0.000000
AvSigVersion 0 0.000000
OsPlatformSubRelease 0 0.000000
Processor 0 0.000000
OsVer 0 0.000000
AppVersion 0 0.000000

Missing Values in Testing Data

1
2
3
4
total = test_data.isnull().sum().sort_values(ascending = False)
percent = (test_data.isnull().sum()/test_data.isnull().count()*100).sort_values(ascending = False)
missing_test_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_test_data.head(50)
Total Percent
PuaMode 7851065 99.972139
Census_ProcessorClass 7835022 99.767854
DefaultBrowsersIdentifier 7546134 96.089277
Census_IsFlightingInternal 6673962 84.983408
Census_InternalBatteryType 5979844 76.144803
SMode 5831272 74.252950
Census_ThresholdOptIn 5529515 70.410504
Census_IsWIMBootEnabled 5522707 70.323814
SmartScreen 3498402 44.547170
OrganizationIdentifier 2482129 31.606380
Wdft_IsGamer 302857 3.856453
Wdft_RegionIdentifier 302857 3.856453
Census_InternalBatteryNumberOfCharges 239233 3.046292
Census_FirmwareManufacturerIdentifier 198634 2.529321
CityIdentifier 191962 2.444363
Census_FirmwareVersionIdentifier 177624 2.261789
Census_IsFlightsDisabled 125801 1.601897
Census_OEMModelIdentifier 95935 1.221596
Census_TotalPhysicalRAM 95051 1.210339
Census_IsAlwaysOnAlwaysConnectedCapable 90152 1.147957
Census_OEMNameIdentifier 89546 1.140241
Census_PrimaryDiskTotalCapacity 74701 0.951211
Census_SystemVolumeTotalCapacity 74690 0.951071
Census_ProcessorModelIdentifier 61314 0.780747
Census_ProcessorManufacturerIdentifier 61281 0.780326
Census_ProcessorCoreCount 61277 0.780275
Census_OSInstallLanguageIdentifier 58743 0.748009
Firewall 58472 0.744558
IeVerIdentifier 49796 0.634081
Census_InternalPrimaryDiagonalDisplaySizeInInches 42610 0.542578
Census_InternalPrimaryDisplayResolutionVertical 42511 0.541317
Census_InternalPrimaryDisplayResolutionHorizontal 42511 0.541317
RtpStateBitfield 32222 0.410301
AVProductStatesIdentifier 23767 0.302639
AVProductsEnabled 23767 0.302639
AVProductsInstalled 23767 0.302639
IsProtected 23649 0.301136
Census_IsVirtualDevice 17291 0.220176
UacLuaenable 7865 0.100150
Census_PrimaryDiskTypeName 6807 0.086677
Census_ChassisTypeName 341 0.004342
GeoNameIdentifier 147 0.001872
Census_PowerPlatformRoleName 26 0.000331
OsBuildLab 17 0.000216
Census_OSEdition 5 0.000064
Census_GenuineStateName 1 0.000013
Platform 0 0.000000
CountryIdentifier 0 0.000000
Census_IsTouchEnabled 0 0.000000
LocaleEnglishNameIdentifier 0 0.000000

Unique Column Values

In this section, I would like to know the values of each column. Specifically, we would like to know the unique values and if there is any value that only exists on the training set.

1
2
training_data_columns = list(train_data.columns)
testing_data_columns = list(test_data.columns)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
unique_val = pd.DataFrame(index = [], columns = ['TotalUniqueCount','SharedUniqueCount','TrainUniqueCount','OnlyInTrain','TestUniqueCount', 'OnlyInTest'])
for column in testing_data_columns[1:]:
unique_val_train = train_data[column].unique()
unique_val_test = test_data[column].unique()

only_in_training = list(set(unique_val_train) - set(unique_val_test))
only_in_testing = list(set(unique_val_test) - set(unique_val_train))

total_unique_val_count = len(set(unique_val_train) | set(unique_val_test))
shared_unique_val_count = len(set(unique_val_train) & set(unique_val_test))

train_unique_val_count = len(unique_val_train)
train_only_unique_val_count = len(only_in_training)

test_unique_val_count = len(unique_val_test)
test_only_unique_val_count = len(only_in_testing)

unique_val.loc[column] = [total_unique_val_count, shared_unique_val_count,
train_unique_val_count, train_only_unique_val_count,
test_unique_val_count, test_only_unique_val_count]

# print(f'{column}:')
# print(f'----------------')
# print(f'Total unique count: {total_unique_count}')
# print(f'Train unique count: {train_unique_val_count}')
# print(f'Test unique count: {test_unique_val_count}')
# print(f'Only in train: {only_in_training[0:5]}{"..." if len(only_in_training) > 5 else ""}')
# print(f'Only in test: {only_in_testing[0:5]}{"..." if len(only_in_testing) > 5 else ""}')
# print('=================================================================')
# print()

unique_val
TotalUniqueCount SharedUniqueCount TrainUniqueCount OnlyInTrain TestUniqueCount OnlyInTest
ProductName 6 6 6 0 6 0
EngineVersion 74 66 70 4 70 4
AppVersion 124 106 110 4 120 14
AvSigVersion 9623 8265 8531 266 9357 1092
IsBeta 2 2 2 0 2 0
RtpStateBitfield 10 7 8 1 9 2
IsSxsPassiveMode 2 2 2 0 2 0
DefaultBrowsersIdentifier 2179 1101 1731 630 1549 448
AVProductStatesIdentifier 39833 12631 28971 16340 23493 10862
AVProductsInstalled 10 6 9 3 7 1
AVProductsEnabled 8 6 7 1 7 1
HasTpm 2 2 2 0 2 0
CountryIdentifier 222 222 222 0 222 0
CityIdentifier 130525 82660 107367 24707 105818 23158
OrganizationIdentifier 53 48 50 2 51 3
GeoNameIdentifier 297 286 293 7 290 4
LocaleEnglishNameIdentifier 254 251 252 1 253 2
Platform 4 4 4 0 4 0
Processor 3 3 3 0 3 0
OsVer 69 33 58 25 44 11
OsBuild 101 53 76 23 78 25
OsSuite 14 13 14 1 13 0
OsPlatformSubRelease 9 9 9 0 9 0
OsBuildLab 776 562 664 102 674 112
SkuEdition 8 8 8 0 8 0
IsProtected 4 2 3 1 3 1
AutoSampleOptIn 2 2 2 0 2 0
PuaMode 3 3 3 0 3 0
SMode 4 2 3 1 3 1
IeVerIdentifier 352 247 304 57 295 48
... ... ... ... ... ... ...
Census_InternalBatteryType 93 50 79 29 64 14
Census_InternalBatteryNumberOfCharges 52838 24610 41088 16478 36360 11750
Census_OSVersion 579 365 469 104 475 110
Census_OSArchitecture 3 3 3 0 3 0
Census_OSBranch 40 21 32 11 29 8
Census_OSBuildNumber 216 105 165 60 156 51
Census_OSBuildRevision 325 254 285 31 294 40
Census_OSEdition 40 30 33 3 37 7
Census_OSSkuName 32 29 30 1 31 2
Census_OSInstallTypeName 9 9 9 0 9 0
Census_OSInstallLanguageIdentifier 41 39 40 1 40 1
Census_OSUILocaleIdentifier 151 135 147 12 139 4
Census_OSWUAutoUpdateOptionsName 6 6 6 0 6 0
Census_IsPortableOperatingSystem 2 2 2 0 2 0
Census_GenuineStateName 6 5 5 0 6 1
Census_ActivationChannel 6 6 6 0 6 0
Census_IsFlightingInternal 4 2 3 1 3 1
Census_IsFlightsDisabled 4 2 3 1 3 1
Census_FlightRing 11 10 10 0 11 1
Census_ThresholdOptIn 4 2 3 1 3 1
Census_FirmwareManufacturerIdentifier 861 575 713 138 723 148
Census_FirmwareVersionIdentifier 58959 41348 50495 9147 49812 8464
Census_IsSecureBootEnabled 2 2 2 0 2 0
Census_IsWIMBootEnabled 4 2 3 1 3 1
Census_IsVirtualDevice 4 2 3 1 3 1
Census_IsTouchEnabled 2 2 2 0 2 0
Census_IsPenCapable 2 2 2 0 2 0
Census_IsAlwaysOnAlwaysConnectedCapable 4 2 3 1 3 1
Wdft_IsGamer 4 2 3 1 3 1
Wdft_RegionIdentifier 17 15 16 1 16 1

81 rows × 6 columns