In this notebook, we will explore the impact of different ways of creating machine learning datasets.
Repeatability is important in machine learning. If you do the same thing now and 5 minutes from now and get different answers, then it makes experimentation difficult. In other words, you will find it difficult to gauge whether a change you made has resulted in an improvement or not.
pip install --upgrade google-cloud-bigquery[bqstorage,pandas]
Requirement already satisfied: google-cloud-bigquery[bqstorage,pandas] in /opt/conda/lib/python3.10/site-packages (3.18.0)
Collecting google-cloud-bigquery[bqstorage,pandas]
Downloading google_cloud_bigquery-3.19.0-py2.py3-none-any.whl.metadata (8.9 kB)
Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1 in /opt/conda/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-cloud-bigquery[bqstorage,pandas]) (1.34.1)
Requirement already satisfied: google-auth<3.0.0dev,>=2.14.1 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.28.1)
Requirement already satisfied: google-cloud-core<3.0.0dev,>=1.6.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.4.1)
Requirement already satisfied: google-resumable-media<3.0dev,>=0.6.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.7.0)
Requirement already satisfied: packaging>=20.0.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (23.2)
Requirement already satisfied: python-dateutil<3.0dev,>=2.7.2 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.9.0)
Requirement already satisfied: requests<3.0.0dev,>=2.21.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.31.0)
Requirement already satisfied: pandas>=1.1.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.0.3)
Requirement already satisfied: pyarrow>=3.0.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (15.0.0)
Requirement already satisfied: db-dtypes<2.0.0dev,>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.2.0)
Requirement already satisfied: google-cloud-bigquery-storage<3.0.0dev,>=2.6.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.24.0)
Requirement already satisfied: grpcio<2.0dev,>=1.47.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.62.0)
Requirement already satisfied: numpy>=1.16.6 in /opt/conda/lib/python3.10/site-packages (from db-dtypes<2.0.0dev,>=0.3.0->google-cloud-bigquery[bqstorage,pandas]) (1.25.2)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.56.2 in /opt/conda/lib/python3.10/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-cloud-bigquery[bqstorage,pandas]) (1.62.0)
Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<4.0.0dev,>=3.19.5 in /opt/conda/lib/python3.10/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-cloud-bigquery[bqstorage,pandas]) (3.20.3)
Requirement already satisfied: grpcio-status<2.0dev,>=1.33.2 in /opt/conda/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-cloud-bigquery[bqstorage,pandas]) (1.48.2)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-bigquery[bqstorage,pandas]) (5.3.3)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-bigquery[bqstorage,pandas]) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-bigquery[bqstorage,pandas]) (4.9)
Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.0 in /opt/conda/lib/python3.10/site-packages (from google-cloud-bigquery-storage<3.0.0dev,>=2.6.0->google-cloud-bigquery[bqstorage,pandas]) (1.23.0)
Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /opt/conda/lib/python3.10/site-packages (from google-resumable-media<3.0dev,>=0.6.0->google-cloud-bigquery[bqstorage,pandas]) (1.5.0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas>=1.1.0->google-cloud-bigquery[bqstorage,pandas]) (2024.1)
Requirement already satisfied: tzdata>=2022.1 in /opt/conda/lib/python3.10/site-packages (from pandas>=1.1.0->google-cloud-bigquery[bqstorage,pandas]) (2024.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil<3.0dev,>=2.7.2->google-cloud-bigquery[bqstorage,pandas]) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.21.0->google-cloud-bigquery[bqstorage,pandas]) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.21.0->google-cloud-bigquery[bqstorage,pandas]) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.21.0->google-cloud-bigquery[bqstorage,pandas]) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.21.0->google-cloud-bigquery[bqstorage,pandas]) (2024.2.2)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /opt/conda/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.0dev,>=2.14.1->google-cloud-bigquery[bqstorage,pandas]) (0.5.1)
Downloading google_cloud_bigquery-3.19.0-py2.py3-none-any.whl (232 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[?25hInstalling collected packages: google-cloud-bigquery
Attempting uninstall: google-cloud-bigquery
Found existing installation: google-cloud-bigquery 3.18.0
Uninstalling google-cloud-bigquery-3.18.0:
Successfully uninstalled google-cloud-bigquery-3.18.0
Successfully installed google-cloud-bigquery-3.19.0
Note: you may need to restart the kernel to use updated packages.
from google.cloud import bigquery
The dataset that we will use is a BigQuery dataset of airline arrival data.
We want to predict the arrival delay of an airline based on the departure delay. The model that we will use is a zero-bias linear model: $$ delay_{arrival} = \alpha * delay_{departure} $$
To train the model is to estimate a good value for $\alpha$.
One approach to estimate alpha is to use this formula: $$ \alpha = \frac{\sum delay_{departure} delay_{arrival} }{ \sum delay_{departure}^2 } $$ Because we'd like to capture the idea that this relationship is different for flights from New York to Los Angeles vs. flights from Austin to Indianapolis (shorter flight, less busy airports), we'd compute a different $alpha$ for each airport-pair. For simplicity, we'll do this model only for flights between Denver and Los Angeles.
compute_alpha = """
#standardSQL
SELECT
SAFE_DIVIDE(SUM(arrdelay * depdelay), SUM(depdelay * depdelay)) AS alpha
FROM
(
SELECT RAND() AS splitfield,
arrdelay,
depdelay
FROM
`cloud-training-demos.airline_ontime_data.flights`
WHERE
Origin = 'DEN' AND Dest = 'LAX'
)
WHERE
splitfield < 0.8
"""
results = bigquery.Client().query(compute_alpha).to_dataframe()
alpha = results['alpha'][0]
print(alpha)
0.9798858028919109
compute_rmse = """
#standardSQL
SELECT
dataset,
SQRT(AVG((arrdelay - ALPHA * depdelay)*(arrdelay - ALPHA * depdelay))) AS rmse,
COUNT(arrdelay) AS num_flights
FROM (
SELECT
IF (RAND() < 0.8, 'train', 'eval') AS dataset,
arrdelay,
depdelay
FROM
`cloud-training-demos.airline_ontime_data.flights`
WHERE
Origin = 'DEN'
AND Dest = 'LAX' )
GROUP BY
dataset
"""
bigquery.Client().query(compute_rmse.replace('ALPHA', str(alpha))).to_dataframe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
dataset | rmse | num_flights | |
---|---|---|---|
0 | train | 13.376918 | 159994 |
1 | eval | 13.713605 | 39865 |
Hint:
Although the calculations are now correct, the experiment is still not repeatable.
Try running it several times; do you get the same answer?
train_and_eval_rand = """
#standardSQL
WITH
alldata AS (
SELECT
IF (RAND() < 0.8,
'train',
'eval') AS dataset,
arrdelay,
depdelay
FROM
`cloud-training-demos.airline_ontime_data.flights`
WHERE
Origin = 'DEN'
AND Dest = 'LAX' ),
training AS (
SELECT
SAFE_DIVIDE( SUM(arrdelay * depdelay) , SUM(depdelay * depdelay)) AS alpha
FROM
alldata
WHERE
dataset = 'train' )
SELECT
MAX(alpha) AS alpha,
dataset,
SQRT(AVG((arrdelay - alpha * depdelay)*(arrdelay - alpha * depdelay))) AS rmse,
COUNT(arrdelay) AS num_flights
FROM
alldata,
training
GROUP BY
dataset
"""
bigquery.Client().query(train_and_eval_rand).to_dataframe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
alpha | dataset | rmse | num_flights | |
---|---|---|---|---|
0 | 0.979285 | eval | 13.446975 | 40143 |
1 | 0.979285 | train | 13.444396 | 159716 |
Let’s split by date and train.
compute_alpha = """
#standardSQL
SELECT
SAFE_DIVIDE(SUM(arrdelay * depdelay), SUM(depdelay * depdelay)) AS alpha
FROM
`cloud-training-demos.airline_ontime_data.flights`
WHERE
Origin = 'DEN' AND Dest = 'LAX'
AND ABS(MOD(FARM_FINGERPRINT(CAST(CURRENT_DATE() AS STRING)), 10)) < 8
"""
results = bigquery.Client().query(compute_alpha).to_dataframe()
alpha = results['alpha'][0]
print(alpha)
0.9795488325223913
We can now use the alpha to compute RMSE. Because the alpha value is repeatable, we don’t need to worry that the alpha in the compute_rmse will be different from the alpha computed in the compute_alpha.
compute_rmse = """
#standardSQL
SELECT
IF(ABS(MOD(FARM_FINGERPRINT(CAST(CURRENT_DATE() AS STRING)), 10)) < 8, 'train', 'eval') AS dataset,
SQRT(AVG((arrdelay - ALPHA * depdelay)*(arrdelay - ALPHA * depdelay))) AS rmse,
COUNT(arrdelay) AS num_flights
FROM
`cloud-training-demos.airline_ontime_data.flights`
WHERE
Origin = 'DEN'
AND Dest = 'LAX'
GROUP BY
dataset
"""
print(bigquery.Client().query(compute_rmse.replace('ALPHA', str(alpha))).to_dataframe().head())
dataset rmse num_flights
0 train 13.444838 199859
Note also that the RMSE on the evaluation dataset more from the RMSE on the training dataset when we do the split correctly. This should be expected; in the RAND() case, there was leakage between training and evaluation datasets, because there is high correlation between flights on the same day.
This is one of the biggest dangers with doing machine learning splits the wrong way -- you will develop a false sense of confidence in how good your model is!
Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.