Vlookups are a very powerful excel formula and they are used a lot to blend & match data in excel.
In this example I am going to show you how you can do a Vlookup in python in a few lines of code.
Why?
Excel vlookups over complex strings and large datasets are very compute intensive and can leave you waiting for a long time for a calculation to complete.
Here are a few examples within SEO tasks where you may run into issues doing lookups in excel.
- Checking what domains within a backlink profile are in common with a competitor
- Looking up information at a URL level
Vlookup in python using pandas
We are going to use pandas in order to complete our vlookup. Pandas is a very useful library to know for any data manipulation tasks, it has simple notation, tons of built in functionality and is scalable across very large datasets.
I have created a dummy dataset with a very simple example of two sheets where a vlookup needs to occur. Before continuing please download the repo from my Github.
Loading excel into a pandas dataframe
Pandas has a built in functionality to read an excel file and create a dataframe. This is essentially an excel sheet that can be used in python
# Read an excel with two sheets into two dataframes
def load_data(name):
df, df1 = pd.read_excel(name, sheet_name="1"), pd.read_excel(name, sheet_name="2")
return df, df1
Pandas has a function called merge which we will be using to merge our data together. We’d like our merge data to appear to the right of our master table which in my example is Sheet1 so therefore we will use the left join functionality. More info on this below.
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
Type of merge to be performed.
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
These are other types of join available which may come in useful another time.
def vlookup(df, df1):
# Merge category data by URL
df2 = df.merge(df1, on='URL', how='left')
df2.to_excel('new.xlsx', sheet_name="3")
return True
The above is the function that completes the vlookup. Pandas is merging everything found in matching URL keys between Sheet1 & Sheet2.
How simple is that!
Vlookup to check if value exists in another column
Something I find myself doing a lot is checking whether or not a value exists within another column. I also created a function which instead returns true of false.
# Check if a value is present within another column
def present_in(df, df1):
# This function gets applied to each value within
def is_in_column(data, values):
if data in values:
return True
else:
return False
df['Present_in_other'] = df['URL'].apply(is_in_column, values=df1['URL'].tolist())
This using pandas apply functionality to check for each value if that value exists within corresponding checked column.
Vlookup Speed Test
In order to demonstrate the difference in speed of doing a vlookup conventionally in excel vs using a script on larger data sources I have built a function to build a large dummy dataset in order to perform a test.
def create_dummy_data(): words_1 = [[''.join(random.choices(string.ascii_uppercase, k=5)),"sheet1"] for i in range(0,500000)] words_2 = [[''.join(random.choices(string.ascii_uppercase, k=5)),"sheet2"] for i in range(0,500000)] # Append lists to URL column in df df = pd.DataFrame(words_1, columns=['URL', 'Item1'] ,index=None) df1 = pd.DataFrame(words_2, columns=['URL','Item2'], index=None) # Save to excel so you can benchmark on your machine writer = pd.ExcelWriter('vloop_test_data.xlsx', engine='xlsxwriter') df.to_excel(writer, sheet_name='1') df1.to_excel(writer, sheet_name='2') writer.save()
Python: 29 seconds
Excel: 45minutes
Here are the core stats of my PC;
- 2.2 GHz Intel Core i7
- 16GB 1600MHZ Ram
Nothing super fancy but excel will calculate across all 8 core (4 physical, 4 virtual) and probably reflects the speed of a typical work laptop pretty well.
As you can see it’s almost 100 times quicker to run this Vlookup across half a million rows than doing so in excel and you don’t have to worry about copying to values once you’ve completed the calculation either.
Leave a Reply