Skip to content

Machine Learning Experiment for ETA service  #356

@CodeBear801

Description

@CodeBear801

Subtask of #355

We plan to build a machine learning model based on user's gps trace data. Here record some experiments and proof of concept for understanding the problem set.

There are several experiment I have done to get familiar with ML, here I record 3 of them which I feel is highly related:

  • New York City Taxi Trip Duration from Kaggle
  • Flight Delay Estimation(gcloud)
  • Flight Delay Estimation(open source stack)

Data and characteristics determine the upper limit of machine learning, and models and algorithms just approach this upper limit.

New York City Taxi Trip Duration from Kaggle

Data source

https://www.kaggle.com/c/nyc-taxi-trip-duration/data

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663
  • No GPS traces
  • The scenario has been set to NY, means training data and test data all exists in NY
  • 1458644 trip records in train.csv, and 625134 trip records in test.csv
    • If we consider to cluster orig point and destination point, each cluster pair(orig-destination location pair) has multiple(lots) of data coverage
      image

OSRM features

id total_distance total_travel_time number_of_steps
id2875421 2009.1 164.9 5
id2377394 2513.2 332.0 6
id3504673 1779.4 235.8 4
  • OSRM route is calculated based on orig/dest point, which will generate distance, duration, number of steps to represent the route
    • When we have gps traces, we could do spatial index mapping to generate a list of spatial index box to represent the route
      image
      more info, Google S2
    • Or, we could do map matching, try to snap points to a list of navigable edges in the graph, then extract more features more info

Weather feature

I think weather feature is crawling from open data website, you could find related data for this Kaggle competition here. More information you could go to here -> 6.1 Weather reports

Feature extracting

  • PCA to transform longitude and latitude, help for decision tree splits
  • Distance
  • Normalize Datetime
  • Speed
  • Clustering orig and dest
  • Temporal and geospatial aggregation

Training

XGBoosting

xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,
            'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}

model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)

Parameter Tune

Most of parameters in XGBoost are about bias variance tradeoff. When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit. XGBoost Parameters

Try with different parameters

# random search on different parameter combination
xgb_pars = []
for MCW in [10, 20, 50, 75, 100]:
    for ETA in [0.05, 0.1, 0.15]:
        for CS in [0.3, 0.4, 0.5]:
            for MD in [6, 8, 10, 12, 15]:
                for SS in [0.5, 0.6, 0.7, 0.8, 0.9]:
                    for LAMBDA in [0.5, 1., 1.5,  2., 3.]:
                        xgb_pars.append({'min_child_weight': MCW, 'eta': ETA, 
                                         'colsample_bytree': CS, 'max_depth': MD,
                                         'subsample': SS, 'lambda': LAMBDA, 
                                         'nthread': -1, 'booster' : 'gbtree', 'eval_metric': 'rmse',
                                         'silent': 1, 'objective': 'reg:linear'})

It takes extremely large amount of resources and time.

Cross Validation

http://blog.mrtz.org/2015/03/09/competition.html

Flight Delay Estimation(gcloud)

  • Keyword: SparkML, Logistic Regression, Tensorflow, Wide-and-Deep, Cloud Dataproc
  • My experiment: notes
  • Summary:
    • Cloud Dataproc is easy to do development and easy to scale. It lunches pre-build container image which contains tensorflow, python3, etc.
    • Use Google's pub/sub system could simulate live streaming with batch data
    • Dataflow, Cloud Bigtable, Data Studio helps a lot with building streaming system, which will discuss more in Streaming experiment for ETA service #357
    • During test, we use batch data(like one month's flight data) as input into machine learning pipeline
    • In live streaming system, using apache beam to aggregate data from pub/sub -> record result as csv -> load data into cloud bigtable -> trigger training with checkpoint, more info in Streaming experiment for ETA service #357

Input Data

|summary|   FL_DATE|UNIQUE_CARRIER|        AIRLINE_ID|CARRIER|            FL_NUM| ORIGIN_AIRPORT_ID|ORIGIN_AIRPORT_SEQ_ID|ORIGIN_CITY_MARKET_ID|ORIGIN|   DEST_AIRPORT_ID|DEST_AIRPORT_SEQ_ID|DEST_CITY_MARKET_ID|DEST|       CRS_DEP_TIME|           DEP_TIME|         DEP_DELAY|          TAXI_OUT|         WHEELS_OFF|          WHEELS_ON|          TAXI_IN|       CRS_ARR_TIME|           ARR_TIME|         ARR_DELAY|           CANCELLED|CANCELLATION_CODE|            DIVERTED|         DISTANCE|   DEP_AIRPORT_LAT|   DEP_AIRPORT_LON|DEP_AIRPORT_TZOFFSET|   ARR_AIRPORT_LAT|   ARR_AIRPORT_LON|ARR_AIRPORT_TZOFFSET|EVENT|NOTIFY_TIME|

y = 0 if arrival delay >= 15 minutes
y = 1 if arrival delay < 15 minutes
// marching learning algorithm predict the probability that the flight is on time

Logic Regression via Spark

more info

After recording all data into csv, could load data into dataframe or rdd(difference), then generate dataframe contains result after features engineering, then calling train

examples = traindata.rdd.map(udf)
lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)

prediction

lrmodel = LogisticRegressionModel.load(sc, MODEL_FILE)
lrmodel.setThreshold(xxx)
lrmodel.predict(Independent features)

evaluation

def eval(labelpred):
    cancel = labelpred.filter(lambda (label, pred): pred < 0.7)
    nocancel = labelpred.filter(lambda (label, pred): pred >= 0.7)
    corr_cancel = cancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
    corr_nocancel = nocancel.filter(lambda (label, pred): label == int(pred >= 0.7)).count()
    
    cancel_denom = cancel.count()
    nocancel_denom = nocancel.count()
    if cancel_denom == 0:
        cancel_denom = 1
    if nocancel_denom == 0:
        nocancel_denom = 1
    return {'total_cancel': cancel.count(), \
            'correct_cancel': float(corr_cancel)/cancel_denom, \
            'total_noncancel': nocancel.count(), \
            'correct_noncancel': float(corr_nocancel)/nocancel_denom \
           }

Tensorflow via Cloud Dataproc

More info

  • Why wide & Deep helps
    image
    source

  • How to implement wide and deep
    image

  • How tensor flow scales
    more info

  • How to scale gcloud ai platform

Flight Delay Estimation(open source stack)

  • Keyword: SparkML, Scikit-Learn, MongoDB, Kafka
  • My experiment: note
  • Summary
    • During development, I build all dependencies and python-connector into docker
    • For the development stage, need to config each docker image and make sure dependencies could work
    • During scale, need docker orchestration tools such as K8S
    • If the environment is well set, development in local is similar as develop on public cloud like gcloud and aws, but harder to manage

Classification

more info

How to improve prediction model and how to evaluate

more info

Metadata

Metadata

Assignees

No one assigned

    Labels

    IdeasIdeas for long-term discussionPrototypeProof of concept

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions