Big query is a really good, cheap, quick (insert superlative) platform to query and visualise data. It has a simple connector into DataStudio which is great to use in reporting as it’s very sharable.
Big Query has its limits and one of them is getting data into it. To upload directly within the interface there is a 10mb limit. You can import from Cloud Storage but you might find issues with data format. E.g. commas in fields being read incorrectly by Big Query.
In this case we need to use the API and do a streaming insert.
I am a big advocate of Pandas for handling large CSV’s and thankfully there is a library that handles this and dumps dataframes into Big Query. https://github.com/pydata/pandas-gbq
Step 1) Read the CSV
df = pd.read_csv('path_to_csv', encoding='utf-8')
Set the projectid of your project you are going to import into, you can get this from the URL of the big query dashboard after this string: ?project=
Step 2) Call the to_gbp function
Finally set the dataset and the table name in the to_gbq function
to_gbq(df, 'datasetname.tablename', projectid ,chunksize=10000, if_exists='append')
There you have a it a simple way to import data into Big Query
Checkout the repo here!
Step 3) Bringing it all together
import pandas as pd
from pandas.io.gbq import read_gbq, to_gbq
projectid = PROJECT_ID
dataset_table = DATASET.TABLE
# Load all to memory and insert into Big Query
df = pd.read_csv('filename.csv', encoding='utf-8')
to_gbq(df, dataset_table, projectid ,chunksize=10000, if_exists='append')
# Read CSV in chunks (500k rows) and upload 10k at a time
for chunk in pd.read_sql_query("SELECT * FROM TABLE_NAME", con=con, chunksize=500000):
df = chunk
to_gbq(df, dataset_table, projectid, chunksize=10000, if_exists='append')