Working with seaborn Python library: visualizing foot traffic based on big data of MAC addresses

Last week I was given a small fun task to work on by P-Sense, a wifi tracking company in Hong Kong: creating some data visualization (bar or line graphs) for hourly foot traffic of some floor in a building based on huge amount of MAC address data collected from the APs in the building. After consulting my friend on data visualization part, I decided to give ‘seaborn’ python library a try.

The data set looked something like this in .csv (the data below has been made up)

My general steps to compose a script was:

  1. Split the time code to know hourly data. For example, since the first number of string ‘01:23:45’ split by colon delimiter would give ‘01’, any row of data where time field has starting with ‘01’ would be data from 1 am.
  2. Add MAC addresses to an array as the script runs through.
  3. Then, once the hour section of time changes, count number of elements in MAC addresses array, and map this number to the past hour in a dictionary.
  4. Empty the MAC addresses array.
  5. Repeat.
import numpy as np
import seaborn as sns
import pandas as pd
import netaddr
from matplotlib import pyplot as plt
import csv
import re

times = []
uniqueMACcount = []
currentcount = '0'
uniqueMAC = []


#convert the times
def split_by_hour(h):
h = re.split('[:]', h)
return h[0]


with open("seaborn.csv") as f:
reader = csv.reader(f)
next(reader) # skip header
data = [r for r in reader]


for element in data:
if currentcount == split_by_hour(element[1]):
if element[2] not in uniqueMAC:
try:
if (netaddr.EUI(element[2]).words[0] & 0b10) == 0 and (netaddr.EUI(element[2]).words[0] & 0b01) == 0:
uniqueMAC.append(element[2])
except (netaddr.core.AddrFormatError):
pass


if currentcount != split_by_hour(element[1]):
times.append(currentcount)
uniqueMACcount.append(len(uniqueMAC))
currentcount = split_by_hour(element[1])
uniqueMAC = []

times.append('19')
uniqueMACcount.append(len(uniqueMAC))

times = list(map(int, times))

df = pd.DataFrame(dict(hour=times, count=uniqueMACcount))
print(df.head(15))
sns_plot = sns.factorplot("hour","count", data=df,kind="bar",size=6,aspect=2,legend_out=False)
sns_plot.savefig("output.png")

Before jumping into composing the entire script, I have a habit of testing my assumptions behind the general idea to see if there are potential problems I might come across later (I will be writing in another post about some tips and systematic approaches to testing these assumptions). One fault I have noticed from the general steps is that since the MAC addresses are added to the past hour after detection of change of the hour, the MAC addresses for the last hour can’t be added to — or rather, counted in order to map the total number of MAC addresses for the final hour.

With more time, I could have probably thought of more elegant way to approach this problem, but with time constraint I decided to go for a quick solution — appending the length of MAC address array outside the loop one more time.

Another major roadblock was that it was my first time working with seaborn library, and I wasn’t sure how I could manually supply filtered data (most basic tutorials online assumed data was clean within csv file) — what data structure could I use? What parameters do I need to supply?). I found out that in this case, I would have to collect an array of hours and an array of unique MAC addresses so that those two arrays could be mapped to X and Y axis. Hence my global variables were empty arrays of ‘times’ to collect hours, ‘uniqueMACcount’ to collect total number of MAC addresses within that hour, ‘uniqueMAC’ to store strings of all unique MAC addresses, and a string variable ‘currentcount’ initialized to 0 so that change in the hour can be detected through comparison of what’s read from script vs what’s in the variable

Looking at the script, you may be wondering what the try-except and netaddr library is doing there in the for loop. After running an older script, seaborn counted for more than 10,000 MAC addresses in the peak hour, which was odd and impossible given the space of the floor. It turned out that the data collected had both universal and local, unicast and multicast MAC addresses, and so MAC addresses needed to filter for only universal and unicast MAC addresses for the purpose of counting unique devices communicating with the routers. Therefore in layman terms, the logic of the loop is: “For each row, if the current hour being read is the hour it’s supposed to be reading, and if the MAC address of that row is unique within that hour, check if MAC address’ least and second least significant bit of the first two characters are false. If it is, yes, it is the unique universal + unicast MAC address we are looking for.” The exception block was added because there were strange invalid MAC addresses in the data, which is not within the scope of investigation for this task.

Outside the for loop, the final hour, in this case, 19th hour, and its length of unique MAC addresses is appended to the corresponding arrays. Line 46 is just translation from string array to integer array, because seaborn library can only sort X-axis in ascending/descending order with number types. The last 4 lines of the script is basically translation of arrays into panda data frame, so that we could use them with seaborn that takes in panda data frames as parameters, and save the figure.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
stine

stine

11 Followers

r&d blog on architecture, software engineering and inspirations