Big query is a really good, cheap, quick (insert superlative) platform to query and visualise data. It has a simple connector into DataStudio which is great to use in reporting as it’s very sharable.
Big Query has its limits and one of them is getting data into it. To upload directly within the interface there is a 10mb limit. You can import from Cloud Storage but you might find issues with data format. E.g. commas in fields being read incorrectly by Big Query.
In this case we need to use the API and do a streaming insert.
I am a big advocate of Pandas for handling large CSV’s and thankfully there is a library that handles this and dumps dataframes into Big Query. https://github.com/pydata/pandas-gbq
Step 1) Read the CSV
df = pd.read_csv('path_to_csv', encoding='utf-8')
Replace the path to CSV with the relative or absolute path to your file
Set the projectid of your project you are going to import into, you can get this from the URL of the big query dashboard after this string: ?project=
Step 2) Call the to_gbp function
Finally set the dataset and the table name in the to_gbq function
to_gbq(df, 'datasetname.tablename', projectid ,chunksize=10000, if_exists='append')
There you have a it a simple way to import data into Big Query
Checkout the repo here!
Step 3) Bringing it all together
import pandas as pd from pandas.io.gbq import read_gbq, to_gbq import sqlite3 import time import glob import os projectid = PROJECT_ID dataset_table = DATASET.TABLE # Load all to memory and insert into Big Query df = pd.read_csv('filename.csv', encoding='utf-8') to_gbq(df, dataset_table, projectid ,chunksize=10000, if_exists='append') # Read CSV in chunks (500k rows) and upload 10k at a time for chunk in pd.read_sql_query("SELECT * FROM TABLE_NAME", con=con, chunksize=500000): df = chunk to_gbq(df, dataset_table, projectid, chunksize=10000, if_exists='append')