Machine Learning Experiment for ETA service 

Subtask of #355

We plan to build a machine learning model based on user's gps trace data.  Here record some experiments and proof of concept for understanding the problem set.

There are several experiment I have done to get familiar with ML, here I record 3 of them which I feel is highly related:
- New York City Taxi Trip Duration from Kaggle
- Flight Delay Estimation(gcloud)
- Flight Delay Estimation(open source stack)

Data and characteristics determine the upper limit of machine learning, and models and algorithms just approach this upper limit.

## [New York City Taxi Trip Duration from Kaggle](https://www.kaggle.com/c/nyc-taxi-trip-duration)
- Keyword: XGBoost, PCA, data visualization, osrm
- Popular notebooks: [NYC Taxi EDA - Update: The fast & the curious](https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious), [Strength of visualization-python visuals tutorial](https://www.kaggle.com/maheshdadhich/strength-of-visualization-python-visuals-tutorial), [From EDA to the Top (LB 0.367)](https://www.kaggle.com/gaborfodor/from-eda-to-the-top-lb-0-367)
- My experiment: [From EDA to the Top (LB 0.367)](https://www.kaggle.com/liuxun801/from-eda-to-the-top-lb-0-367), [Strength of visualization-python visuals tutorial](https://www.kaggle.com/liuxun801/strength-of-visualization-python-visuals-tutorial)
- Summary: 
     + Kaggle's solution is good for inspiration and hands-on experiment but is far away from production.  There are certain patterns in Kaggle competition, most of Kaggle winner uses XGBoost, or artificial neural network for unstructured data.  
     + But it helps me to think like a `applied machine learning engineer`.
     + Kaggle provide a convenient environment for ML, `python notebook` provided by website help to generate live statistic, and we could also download the docker image and deploy on other cloud([Kaggle Python docker image](https://github.com/Kaggle/docker-python), [hub](https://hub.docker.com/r/kaggle/python/), [instruction](https://www.kaggle.com/general/20036))
- Background: [discussion in OSRM's community](https://github.com/Project-OSRM/osrm-backend/issues/4449)

### Data source
https://www.kaggle.com/c/nyc-taxi-trip-duration/data
```
	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663
```
- No GPS traces
- The scenario has been set to NY, means training data and test data all exists in NY
- 1458644 trip records in train.csv, and 625134 trip records in test.csv
   + If we consider to cluster orig point and destination point, each cluster pair(orig-destination location pair) has multiple(lots) of data coverage
![image](https://user-images.githubusercontent.com/16873751/85341443-0a116a00-b49d-11ea-828b-f6843aac6519.png)



### OSRM features

id | total_distance | total_travel_time | number_of_steps
-- | -- | -- | --
id2875421 | 2009.1 | 164.9 | 5
id2377394 | 2513.2 | 332.0 | 6
id3504673 | 1779.4 | 235.8 | 4

- OSRM route is calculated based on orig/dest point, which will generate distance, duration, number of steps to represent the `route`
    + When we have `gps traces`, we could do spatial index mapping to generate a list of spatial index box to represent the route
     ![image](https://user-images.githubusercontent.com/16873751/84936884-ca5b1480-b08f-11ea-85d5-a4c98f6540bd.png)
     [more info](https://github.com/Telenav/osrm-backend/issues/196), [Google S2](https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/spatial_index/google_s2.md)
    + Or, we could do **map matching**, try to snap points to a list of navigable edges in the graph, then extract more features [more info](https://github.com/Telenav/open-source-spec/blob/master/routing_basic/doc/mapmatching_basic.md)

### Weather feature
I think weather feature is crawling from open data website, you could find related data for this Kaggle competition [here](https://www.kaggle.com/mathijs/weather-data-in-new-york-city-2016).  More information you could go to [here](https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious) -> 6.1 Weather reports

### Feature extracting
- `PCA` to transform longitude and latitude, help for decision tree splits
- `Distance`
- Normalize `Datetime`
- Speed
- Clustering orig and dest
- Temporal and geospatial aggregation



### Training
XGBoosting
```py
xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,
            'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}

model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)
```

### Parameter Tune
Most of parameters in XGBoost are about bias variance tradeoff.  When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit.  [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html)

Try with different parameters 
```py
# random search on different parameter combination
xgb_pars = []
for MCW in [10, 20, 50, 75, 100]:
    for ETA in [0.05, 0.1, 0.15]:
        for CS in [0.3, 0.4, 0.5]:
            for MD in [6, 8, 10, 12, 15]:
                for SS in [0.5, 0.6, 0.7, 0.8, 0.9]:
                    for LAMBDA in [0.5, 1., 1.5,  2., 3.]:
                        xgb_pars.append({'min_child_weight': MCW, 'eta': ETA, 
                                         'colsample_bytree': CS, 'max_depth': MD,
                                         'subsample': SS, 'lambda': LAMBDA, 
                                         'nthread': -1, 'booster' : 'gbtree', 'eval_metric': 'rmse',
                                         'silent': 1, 'objective': 'reg:linear'})
```
It takes extremely large amount of resources and time.

### Cross Validation
http://blog.mrtz.org/2015/03/09/competition.html


## Flight Delay Estimation(gcloud)

- Keyword: SparkML, Logistic Regression, Tensorflow, [Wide-and-Deep](https://arxiv.org/abs/1606.07792), [Cloud Dataproc](https://cloud.google.com/dataproc/docs/concepts/overview)
- My experiment: [notes](https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/flight_eta_via_gcloud.md)
- Summary:
   + `Cloud Dataproc` is easy to do development and easy to scale.  It lunches pre-build container image which contains tensorflow, python3, etc.
   + Use Google's `pub/sub` system could simulate live streaming with batch data
   + `Dataflow`, `Cloud Bigtable`, `Data Studio` helps a lot with building streaming system, which will discuss more in #357
   + During test, we use batch data(like one month's flight data) as input into machine learning pipeline
   + In live streaming system, using `apache beam` to aggregate data from `pub/sub` -> record result as `csv` -> load data into `cloud bigtable` -> trigger training with [checkpoint](https://www.tensorflow.org/guide/checkpoint), more info in #357 

### Input Data

```
|summary|   FL_DATE|UNIQUE_CARRIER|        AIRLINE_ID|CARRIER|            FL_NUM| ORIGIN_AIRPORT_ID|ORIGIN_AIRPORT_SEQ_ID|ORIGIN_CITY_MARKET_ID|ORIGIN|   DEST_AIRPORT_ID|DEST_AIRPORT_SEQ_ID|DEST_CITY_MARKET_ID|DEST|       CRS_DEP_TIME|           DEP_TIME|         DEP_DELAY|          TAXI_OUT|         WHEELS_OFF|          WHEELS_ON|          TAXI_IN|       CRS_ARR_TIME|           ARR_TIME|         ARR_DELAY|           CANCELLED|CANCELLATION_CODE|            DIVERTED|         DISTANCE|   DEP_AIRPORT_LAT|   DEP_AIRPORT_LON|DEP_AIRPORT_TZOFFSET|   ARR_AIRPORT_LAT|   ARR_AIRPORT_LON|ARR_AIRPORT_TZOFFSET|EVENT|NOTIFY_TIME|
```
- Explanation of attributes: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
- Always keep separate data storage and computation in mind
- Massive amount of data, and data has `fixed` orig and destination

y = 0 if arrival delay >= 15 minutes
y = 1 if arrival delay < 15 minutes
// marching learning algorithm predict the probability that the flight is on time


#### Logic Regression via Spark
[more info](https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/flight_eta_via_gcloud.md#cloud-dataproc-cluster-with-initialization-actions-for-datalab)


After recording all data into `csv`, could load data into `dataframe` or `rdd`([difference](https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/flight_eta_via_gcloud.md#cloud-dataproc-cluster-with-initialization-actions-for-datalab)), then generate `dataframe` contains result after `features engineering`, then calling `train`
```py
examples = traindata.rdd.map(udf)
lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)
```
prediction
```py
lrmodel = LogisticRegressionModel.load(sc, MODEL_FILE)
lrmodel.setThreshold(xxx)
lrmodel.predict(Independent features)
```
evaluation
```py
def eval(labelpred):
    cancel = labelpred.filter(lambda (label, pred): pred < 0.7)
    nocancel = labelpred.filter(lambda (label, pred): pred >= 0.7)
    corr_cancel = cancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
    corr_nocancel = nocancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
    
    cancel_denom = cancel.count()
    nocancel_denom = nocancel.count()
    if cancel_denom == 0:
        cancel_denom = 1
    if nocancel_denom == 0:
        nocancel_denom = 1
    return {'total_cancel': cancel.count(), \
            'correct_cancel': float(corr_cancel)/cancel_denom, \
            'total_noncancel': nocancel.count(), \
            'correct_noncancel': float(corr_nocancel)/nocancel_denom \
           }

```

#### Tensorflow via Cloud Dataproc
[More info](https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/flight_eta_via_gcloud.md#machine-learning-classifier-via-tensorflow)

- Why wide & Deep helps
![image](https://user-images.githubusercontent.com/16873751/84963054-356f1000-b0bd-11ea-83fb-cb260691836e.png)
[source](https://www.youtube.com/watch?v=NV1tkZ9Lq48)

- How to implement wide and deep
![image](https://user-images.githubusercontent.com/16873751/84963384-19b83980-b0be-11ea-95b5-de8f7f87e4f0.png)

- How tensor flow scales
[more info](https://www.youtube.com/watch?v=bRMGoPqsn20)

- How to scale gcloud ai platform


## Flight Delay Estimation(open source stack)
- Keyword: SparkML, Scikit-Learn, MongoDB, Kafka
- My experiment: [note](https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/flight_eta_via_opensource_stack.md#classifier-via-spark-mllib)
- Summary
   + During development, I build all dependencies and python-connector into docker
   + For the development stage, need to config each docker image and make sure dependencies could work
   + During scale, need docker orchestration tools such as K8S
   + If the environment is well set, development in local is similar as develop on public cloud like gcloud and aws, but harder to manage

### Classification
[more info](https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/flight_eta_via_opensource_stack.md#classifier-via-spark-mllib)


### How to improve prediction model and how to evaluate
[more info](https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/navigation/eta/flight_eta_via_opensource_stack.md#how-to-improve-prediction-model) 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Machine Learning Experiment for ETA service #356

New York City Taxi Trip Duration from Kaggle

Data source

OSRM features

Weather feature

Feature extracting

Training

Parameter Tune

Cross Validation

Flight Delay Estimation(gcloud)

Input Data

Logic Regression via Spark

Tensorflow via Cloud Dataproc

Flight Delay Estimation(open source stack)

Classification

How to improve prediction model and how to evaluate

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

id	total_distance	total_travel_time	number_of_steps
id2875421	2009.1	164.9	5
id2377394	2513.2	332.0	6
id3504673	1779.4	235.8	4

Machine Learning Experiment for ETA service #356

Description

New York City Taxi Trip Duration from Kaggle

Data source

OSRM features

Weather feature

Feature extracting

Training

Parameter Tune

Cross Validation

Flight Delay Estimation(gcloud)

Input Data

Logic Regression via Spark

Tensorflow via Cloud Dataproc

Flight Delay Estimation(open source stack)

Classification

How to improve prediction model and how to evaluate

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions