-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Subtask of #355
We plan to build a machine learning model based on user's gps trace data. Here record some experiments and proof of concept for understanding the problem set.
There are several experiment I have done to get familiar with ML, here I record 3 of them which I feel is highly related:
- New York City Taxi Trip Duration from Kaggle
- Flight Delay Estimation(gcloud)
- Flight Delay Estimation(open source stack)
Data and characteristics determine the upper limit of machine learning, and models and algorithms just approach this upper limit.
New York City Taxi Trip Duration from Kaggle
- Keyword: XGBoost, PCA, data visualization, osrm
- Popular notebooks: NYC Taxi EDA - Update: The fast & the curious, Strength of visualization-python visuals tutorial, From EDA to the Top (LB 0.367)
- My experiment: From EDA to the Top (LB 0.367), Strength of visualization-python visuals tutorial
- Summary:
- Kaggle's solution is good for inspiration and hands-on experiment but is far away from production. There are certain patterns in Kaggle competition, most of Kaggle winner uses XGBoost, or artificial neural network for unstructured data.
- But it helps me to think like a
applied machine learning engineer
. - Kaggle provide a convenient environment for ML,
python notebook
provided by website help to generate live statistic, and we could also download the docker image and deploy on other cloud(Kaggle Python docker image, hub, instruction)
- Background: discussion in OSRM's community
Data source
https://www.kaggle.com/c/nyc-taxi-trip-duration/data
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663
- No GPS traces
- The scenario has been set to NY, means training data and test data all exists in NY
- 1458644 trip records in train.csv, and 625134 trip records in test.csv
OSRM features
id | total_distance | total_travel_time | number_of_steps |
---|---|---|---|
id2875421 | 2009.1 | 164.9 | 5 |
id2377394 | 2513.2 | 332.0 | 6 |
id3504673 | 1779.4 | 235.8 | 4 |
- OSRM route is calculated based on orig/dest point, which will generate distance, duration, number of steps to represent the
route
Weather feature
I think weather feature is crawling from open data website, you could find related data for this Kaggle competition here. More information you could go to here -> 6.1 Weather reports
Feature extracting
PCA
to transform longitude and latitude, help for decision tree splitsDistance
- Normalize
Datetime
- Speed
- Clustering orig and dest
- Temporal and geospatial aggregation
Training
XGBoosting
xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,
'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'booster' : 'gbtree', 'silent': 1,
'eval_metric': 'rmse', 'objective': 'reg:linear'}
model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
maximize=False, verbose_eval=10)
Parameter Tune
Most of parameters in XGBoost are about bias variance tradeoff. When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit. XGBoost Parameters
Try with different parameters
# random search on different parameter combination
xgb_pars = []
for MCW in [10, 20, 50, 75, 100]:
for ETA in [0.05, 0.1, 0.15]:
for CS in [0.3, 0.4, 0.5]:
for MD in [6, 8, 10, 12, 15]:
for SS in [0.5, 0.6, 0.7, 0.8, 0.9]:
for LAMBDA in [0.5, 1., 1.5, 2., 3.]:
xgb_pars.append({'min_child_weight': MCW, 'eta': ETA,
'colsample_bytree': CS, 'max_depth': MD,
'subsample': SS, 'lambda': LAMBDA,
'nthread': -1, 'booster' : 'gbtree', 'eval_metric': 'rmse',
'silent': 1, 'objective': 'reg:linear'})
It takes extremely large amount of resources and time.
Cross Validation
http://blog.mrtz.org/2015/03/09/competition.html
Flight Delay Estimation(gcloud)
- Keyword: SparkML, Logistic Regression, Tensorflow, Wide-and-Deep, Cloud Dataproc
- My experiment: notes
- Summary:
Cloud Dataproc
is easy to do development and easy to scale. It lunches pre-build container image which contains tensorflow, python3, etc.- Use Google's
pub/sub
system could simulate live streaming with batch data Dataflow
,Cloud Bigtable
,Data Studio
helps a lot with building streaming system, which will discuss more in Streaming experiment for ETA service #357- During test, we use batch data(like one month's flight data) as input into machine learning pipeline
- In live streaming system, using
apache beam
to aggregate data frompub/sub
-> record result ascsv
-> load data intocloud bigtable
-> trigger training with checkpoint, more info in Streaming experiment for ETA service #357
Input Data
|summary| FL_DATE|UNIQUE_CARRIER| AIRLINE_ID|CARRIER| FL_NUM| ORIGIN_AIRPORT_ID|ORIGIN_AIRPORT_SEQ_ID|ORIGIN_CITY_MARKET_ID|ORIGIN| DEST_AIRPORT_ID|DEST_AIRPORT_SEQ_ID|DEST_CITY_MARKET_ID|DEST| CRS_DEP_TIME| DEP_TIME| DEP_DELAY| TAXI_OUT| WHEELS_OFF| WHEELS_ON| TAXI_IN| CRS_ARR_TIME| ARR_TIME| ARR_DELAY| CANCELLED|CANCELLATION_CODE| DIVERTED| DISTANCE| DEP_AIRPORT_LAT| DEP_AIRPORT_LON|DEP_AIRPORT_TZOFFSET| ARR_AIRPORT_LAT| ARR_AIRPORT_LON|ARR_AIRPORT_TZOFFSET|EVENT|NOTIFY_TIME|
- Explanation of attributes: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
- Always keep separate data storage and computation in mind
- Massive amount of data, and data has
fixed
orig and destination
y = 0 if arrival delay >= 15 minutes
y = 1 if arrival delay < 15 minutes
// marching learning algorithm predict the probability that the flight is on time
Logic Regression via Spark
After recording all data into csv
, could load data into dataframe
or rdd
(difference), then generate dataframe
contains result after features engineering
, then calling train
examples = traindata.rdd.map(udf)
lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)
prediction
lrmodel = LogisticRegressionModel.load(sc, MODEL_FILE)
lrmodel.setThreshold(xxx)
lrmodel.predict(Independent features)
evaluation
def eval(labelpred):
cancel = labelpred.filter(lambda (label, pred): pred < 0.7)
nocancel = labelpred.filter(lambda (label, pred): pred >= 0.7)
corr_cancel = cancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
corr_nocancel = nocancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
cancel_denom = cancel.count()
nocancel_denom = nocancel.count()
if cancel_denom == 0:
cancel_denom = 1
if nocancel_denom == 0:
nocancel_denom = 1
return {'total_cancel': cancel.count(), \
'correct_cancel': float(corr_cancel)/cancel_denom, \
'total_noncancel': nocancel.count(), \
'correct_noncancel': float(corr_nocancel)/nocancel_denom \
}
Tensorflow via Cloud Dataproc
Flight Delay Estimation(open source stack)
- Keyword: SparkML, Scikit-Learn, MongoDB, Kafka
- My experiment: note
- Summary
- During development, I build all dependencies and python-connector into docker
- For the development stage, need to config each docker image and make sure dependencies could work
- During scale, need docker orchestration tools such as K8S
- If the environment is well set, development in local is similar as develop on public cloud like gcloud and aws, but harder to manage