Competition Information
This is the second Microsoft Malware Hosted competition on Kaggle. For more detials, please see here.
The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways.With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security. As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.
Goals
The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft’s endpoint protection solution, Windows Defender. Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.
About This Kernel
In this kernal, I would like to recored my personal EDA for Microsoft Malware competition.
Imported Packages
1 | import numpy as np |
Data Types
This part was taken from the helpful kernel.
1 | dtypes = { |
1 | %time train_data = pd.read_csv("data/train.csv", dtype=dtypes) |
Wall time: 2min 7s
Wall time: 1min 51s
Exploratory Data Analysis
1 | train_data.head(20) |
‘HasDetections’ Distribustion
1 | train_data['HasDetections'].value_counts() |
0 4462591
1 4458892
Name: HasDetections, dtype: int64
Missing Vaules
Missing Values in Training Data
1 | total = train_data.isnull().sum().sort_values(ascending = False) |
Total | Percent | |
---|---|---|
PuaMode | 8919174 | 99.974119 |
Census_ProcessorClass | 8884852 | 99.589407 |
DefaultBrowsersIdentifier | 8488045 | 95.141637 |
Census_IsFlightingInternal | 7408759 | 83.044030 |
Census_InternalBatteryType | 6338429 | 71.046809 |
Census_ThresholdOptIn | 5667325 | 63.524472 |
Census_IsWIMBootEnabled | 5659703 | 63.439038 |
SmartScreen | 3177011 | 35.610795 |
OrganizationIdentifier | 2751518 | 30.841487 |
SMode | 537759 | 6.027686 |
CityIdentifier | 325409 | 3.647477 |
Wdft_IsGamer | 303451 | 3.401352 |
Wdft_RegionIdentifier | 303451 | 3.401352 |
Census_InternalBatteryNumberOfCharges | 268755 | 3.012448 |
Census_FirmwareManufacturerIdentifier | 183257 | 2.054109 |
Census_IsFlightsDisabled | 160523 | 1.799286 |
Census_FirmwareVersionIdentifier | 160133 | 1.794915 |
Census_OEMModelIdentifier | 102233 | 1.145919 |
Census_OEMNameIdentifier | 95478 | 1.070203 |
Firewall | 91350 | 1.023933 |
Census_TotalPhysicalRAM | 80533 | 0.902686 |
Census_IsAlwaysOnAlwaysConnectedCapable | 71343 | 0.799676 |
Census_OSInstallLanguageIdentifier | 60084 | 0.673475 |
IeVerIdentifier | 58894 | 0.660137 |
Census_PrimaryDiskTotalCapacity | 53016 | 0.594251 |
Census_SystemVolumeTotalCapacity | 53002 | 0.594094 |
Census_InternalPrimaryDiagonalDisplaySizeInInches | 47134 | 0.528320 |
Census_InternalPrimaryDisplayResolutionHorizontal | 46986 | 0.526661 |
Census_InternalPrimaryDisplayResolutionVertical | 46986 | 0.526661 |
Census_ProcessorModelIdentifier | 41343 | 0.463410 |
Census_ProcessorManufacturerIdentifier | 41313 | 0.463073 |
Census_ProcessorCoreCount | 41306 | 0.462995 |
AVProductsEnabled | 36221 | 0.405998 |
AVProductsInstalled | 36221 | 0.405998 |
AVProductStatesIdentifier | 36221 | 0.405998 |
IsProtected | 36044 | 0.404014 |
RtpStateBitfield | 32318 | 0.362249 |
Census_IsVirtualDevice | 15953 | 0.178816 |
Census_PrimaryDiskTypeName | 12844 | 0.143967 |
UacLuaenable | 10838 | 0.121482 |
Census_ChassisTypeName | 623 | 0.006983 |
GeoNameIdentifier | 213 | 0.002387 |
Census_PowerPlatformRoleName | 55 | 0.000616 |
OsBuildLab | 21 | 0.000235 |
LocaleEnglishNameIdentifier | 0 | 0.000000 |
AvSigVersion | 0 | 0.000000 |
OsPlatformSubRelease | 0 | 0.000000 |
Processor | 0 | 0.000000 |
OsVer | 0 | 0.000000 |
AppVersion | 0 | 0.000000 |
Missing Values in Testing Data
1 | total = test_data.isnull().sum().sort_values(ascending = False) |
Total | Percent | |
---|---|---|
PuaMode | 7851065 | 99.972139 |
Census_ProcessorClass | 7835022 | 99.767854 |
DefaultBrowsersIdentifier | 7546134 | 96.089277 |
Census_IsFlightingInternal | 6673962 | 84.983408 |
Census_InternalBatteryType | 5979844 | 76.144803 |
SMode | 5831272 | 74.252950 |
Census_ThresholdOptIn | 5529515 | 70.410504 |
Census_IsWIMBootEnabled | 5522707 | 70.323814 |
SmartScreen | 3498402 | 44.547170 |
OrganizationIdentifier | 2482129 | 31.606380 |
Wdft_IsGamer | 302857 | 3.856453 |
Wdft_RegionIdentifier | 302857 | 3.856453 |
Census_InternalBatteryNumberOfCharges | 239233 | 3.046292 |
Census_FirmwareManufacturerIdentifier | 198634 | 2.529321 |
CityIdentifier | 191962 | 2.444363 |
Census_FirmwareVersionIdentifier | 177624 | 2.261789 |
Census_IsFlightsDisabled | 125801 | 1.601897 |
Census_OEMModelIdentifier | 95935 | 1.221596 |
Census_TotalPhysicalRAM | 95051 | 1.210339 |
Census_IsAlwaysOnAlwaysConnectedCapable | 90152 | 1.147957 |
Census_OEMNameIdentifier | 89546 | 1.140241 |
Census_PrimaryDiskTotalCapacity | 74701 | 0.951211 |
Census_SystemVolumeTotalCapacity | 74690 | 0.951071 |
Census_ProcessorModelIdentifier | 61314 | 0.780747 |
Census_ProcessorManufacturerIdentifier | 61281 | 0.780326 |
Census_ProcessorCoreCount | 61277 | 0.780275 |
Census_OSInstallLanguageIdentifier | 58743 | 0.748009 |
Firewall | 58472 | 0.744558 |
IeVerIdentifier | 49796 | 0.634081 |
Census_InternalPrimaryDiagonalDisplaySizeInInches | 42610 | 0.542578 |
Census_InternalPrimaryDisplayResolutionVertical | 42511 | 0.541317 |
Census_InternalPrimaryDisplayResolutionHorizontal | 42511 | 0.541317 |
RtpStateBitfield | 32222 | 0.410301 |
AVProductStatesIdentifier | 23767 | 0.302639 |
AVProductsEnabled | 23767 | 0.302639 |
AVProductsInstalled | 23767 | 0.302639 |
IsProtected | 23649 | 0.301136 |
Census_IsVirtualDevice | 17291 | 0.220176 |
UacLuaenable | 7865 | 0.100150 |
Census_PrimaryDiskTypeName | 6807 | 0.086677 |
Census_ChassisTypeName | 341 | 0.004342 |
GeoNameIdentifier | 147 | 0.001872 |
Census_PowerPlatformRoleName | 26 | 0.000331 |
OsBuildLab | 17 | 0.000216 |
Census_OSEdition | 5 | 0.000064 |
Census_GenuineStateName | 1 | 0.000013 |
Platform | 0 | 0.000000 |
CountryIdentifier | 0 | 0.000000 |
Census_IsTouchEnabled | 0 | 0.000000 |
LocaleEnglishNameIdentifier | 0 | 0.000000 |
Unique Column Values
In this section, I would like to know the values of each column. Specifically, we would like to know the unique values and if there is any value that only exists on the training set.
1 | training_data_columns = list(train_data.columns) |
1 | unique_val = pd.DataFrame(index = [], columns = ['TotalUniqueCount','SharedUniqueCount','TrainUniqueCount','OnlyInTrain','TestUniqueCount', 'OnlyInTest']) |
TotalUniqueCount | SharedUniqueCount | TrainUniqueCount | OnlyInTrain | TestUniqueCount | OnlyInTest | |
---|---|---|---|---|---|---|
ProductName | 6 | 6 | 6 | 0 | 6 | 0 |
EngineVersion | 74 | 66 | 70 | 4 | 70 | 4 |
AppVersion | 124 | 106 | 110 | 4 | 120 | 14 |
AvSigVersion | 9623 | 8265 | 8531 | 266 | 9357 | 1092 |
IsBeta | 2 | 2 | 2 | 0 | 2 | 0 |
RtpStateBitfield | 10 | 7 | 8 | 1 | 9 | 2 |
IsSxsPassiveMode | 2 | 2 | 2 | 0 | 2 | 0 |
DefaultBrowsersIdentifier | 2179 | 1101 | 1731 | 630 | 1549 | 448 |
AVProductStatesIdentifier | 39833 | 12631 | 28971 | 16340 | 23493 | 10862 |
AVProductsInstalled | 10 | 6 | 9 | 3 | 7 | 1 |
AVProductsEnabled | 8 | 6 | 7 | 1 | 7 | 1 |
HasTpm | 2 | 2 | 2 | 0 | 2 | 0 |
CountryIdentifier | 222 | 222 | 222 | 0 | 222 | 0 |
CityIdentifier | 130525 | 82660 | 107367 | 24707 | 105818 | 23158 |
OrganizationIdentifier | 53 | 48 | 50 | 2 | 51 | 3 |
GeoNameIdentifier | 297 | 286 | 293 | 7 | 290 | 4 |
LocaleEnglishNameIdentifier | 254 | 251 | 252 | 1 | 253 | 2 |
Platform | 4 | 4 | 4 | 0 | 4 | 0 |
Processor | 3 | 3 | 3 | 0 | 3 | 0 |
OsVer | 69 | 33 | 58 | 25 | 44 | 11 |
OsBuild | 101 | 53 | 76 | 23 | 78 | 25 |
OsSuite | 14 | 13 | 14 | 1 | 13 | 0 |
OsPlatformSubRelease | 9 | 9 | 9 | 0 | 9 | 0 |
OsBuildLab | 776 | 562 | 664 | 102 | 674 | 112 |
SkuEdition | 8 | 8 | 8 | 0 | 8 | 0 |
IsProtected | 4 | 2 | 3 | 1 | 3 | 1 |
AutoSampleOptIn | 2 | 2 | 2 | 0 | 2 | 0 |
PuaMode | 3 | 3 | 3 | 0 | 3 | 0 |
SMode | 4 | 2 | 3 | 1 | 3 | 1 |
IeVerIdentifier | 352 | 247 | 304 | 57 | 295 | 48 |
... | ... | ... | ... | ... | ... | ... |
Census_InternalBatteryType | 93 | 50 | 79 | 29 | 64 | 14 |
Census_InternalBatteryNumberOfCharges | 52838 | 24610 | 41088 | 16478 | 36360 | 11750 |
Census_OSVersion | 579 | 365 | 469 | 104 | 475 | 110 |
Census_OSArchitecture | 3 | 3 | 3 | 0 | 3 | 0 |
Census_OSBranch | 40 | 21 | 32 | 11 | 29 | 8 |
Census_OSBuildNumber | 216 | 105 | 165 | 60 | 156 | 51 |
Census_OSBuildRevision | 325 | 254 | 285 | 31 | 294 | 40 |
Census_OSEdition | 40 | 30 | 33 | 3 | 37 | 7 |
Census_OSSkuName | 32 | 29 | 30 | 1 | 31 | 2 |
Census_OSInstallTypeName | 9 | 9 | 9 | 0 | 9 | 0 |
Census_OSInstallLanguageIdentifier | 41 | 39 | 40 | 1 | 40 | 1 |
Census_OSUILocaleIdentifier | 151 | 135 | 147 | 12 | 139 | 4 |
Census_OSWUAutoUpdateOptionsName | 6 | 6 | 6 | 0 | 6 | 0 |
Census_IsPortableOperatingSystem | 2 | 2 | 2 | 0 | 2 | 0 |
Census_GenuineStateName | 6 | 5 | 5 | 0 | 6 | 1 |
Census_ActivationChannel | 6 | 6 | 6 | 0 | 6 | 0 |
Census_IsFlightingInternal | 4 | 2 | 3 | 1 | 3 | 1 |
Census_IsFlightsDisabled | 4 | 2 | 3 | 1 | 3 | 1 |
Census_FlightRing | 11 | 10 | 10 | 0 | 11 | 1 |
Census_ThresholdOptIn | 4 | 2 | 3 | 1 | 3 | 1 |
Census_FirmwareManufacturerIdentifier | 861 | 575 | 713 | 138 | 723 | 148 |
Census_FirmwareVersionIdentifier | 58959 | 41348 | 50495 | 9147 | 49812 | 8464 |
Census_IsSecureBootEnabled | 2 | 2 | 2 | 0 | 2 | 0 |
Census_IsWIMBootEnabled | 4 | 2 | 3 | 1 | 3 | 1 |
Census_IsVirtualDevice | 4 | 2 | 3 | 1 | 3 | 1 |
Census_IsTouchEnabled | 2 | 2 | 2 | 0 | 2 | 0 |
Census_IsPenCapable | 2 | 2 | 2 | 0 | 2 | 0 |
Census_IsAlwaysOnAlwaysConnectedCapable | 4 | 2 | 3 | 1 | 3 | 1 |
Wdft_IsGamer | 4 | 2 | 3 | 1 | 3 | 1 |
Wdft_RegionIdentifier | 17 | 15 | 16 | 1 | 16 | 1 |
81 rows × 6 columns