You may not submit a project that was not started and completed during the duration of the datathon. Checkout the provided datasets listed below as a starting point for your project, or feel free to choose a different dataset, just provide the source of the data along with your submission. Use of kaggle datasets is discouraged.

We have 5 Tracks at the Datathon this year!

  1. Chevron Challenge
  2. Predicting the Severity of Forest Fires Applied Machine Learning for Social Good Sponsored by Accenture
  3. Machine Learning Systems Track
  4. Houston/Texas Trends Track
  5. Various Other Provided Datasets

Chevron Challenge

The process of drilling new wells, especially offshore, is extremely challenging and costly. After reaching the seabed more than 3,000 feet under water, rigs in the Gulf of Mexico must drill through an additional 20,000 feet of rock. In these extreme environments, where temperatures and pressures far exceed regular drilling conditions, specialized equipment and teams are required. These operations can involve hundreds of people and equipment with very high daily drilling costs. Reducing the time it takes to drill by even a few hours per well can result in significant savings for the company and provide a significant competitive advantage as more and more wells are drilled. Your task is to use historic drilling data to build a model that can predict the rate of penetration conditional on controllable drilling parameters and/or parameters that are known before the drilling process begins. This model could then be used as part of the “Drilling Roadmap” development process, where the WOB and RPM (Among other variables) are determined before drilling begins.

Download the rest of the information and the data here:

Chevron Challenge Zip File

Scoring.csv File

Predicting the Severity of Forest Fires Applied Machine Learning for Social Good Sponsored by Accenture


Forest fires are a serious natural disaster experienced by countries around the world. Recently, forest fires in Australia and California have engulfed huge swaths of habitat causing untold loss to local communities, wildlife, and ecosystems. Forest fires across the globe are responsible for killing thousands per year and costing billions of dollars. Read More


Your goal is to develop a model that predicts the severity of a wildfire based on the expected area it will engulf using the fire’s initial conditions. You then need to create a plan for how your model should be used by firefighters or other governmental entities to combat this issue more intelligently and effectively. Included in your plan, you should take a step back from the numbers and discuss realistic data accessibility, whether the key factors found in your model make sense, and whether you would trust your model in this high-stakes situation.

Dataset Description

You are free to use any dataset you find online that assists you in answering the overall question. We have provided an initial data set to help you get started.

Kaggle Forest Fire Dataset

This data publication contains a spatial database of wildfires that occurred in the United States from 1992 to 2015. It is the third update of a publication originally generated to support the national Fire Program Analysis (FPA) system. The wildfire records were acquired from the reporting systems of federal, state, and local fire organizations. The following core data elements were required for records to be included in this data publication: discovery date, final fire size, and a point location at least as precise as Public Land Survey System (PLSS) section (1-square mile grid). The data were transformed to conform, when possible, to the data standards of the National Wildfire Coordinating Group (NWCG). Basic error-checking was performed and redundant records were identified and removed, to the degree possible. The resulting product, referred to as the Fire Program Analysis Fire-occurrence Database (FPA FOD), includes 1.88 million geo-referenced wildfire records, representing a total of 140 million acres burned during the 24-year period.


Machine Learning Systems Track


This track is designed for students who may not have an extensive background in modeling but are experienced programmers who are looking to get involved with data science. In practice, especially in large industry projects, the modeling portion is generally only a small portion of the overall effort in serving a production-grade machine learning system. To reflect that, we’re dedicating this track specifically for building tools for machine learning, fusing datasets, or interesting applications related to data science but not modeling per se.


The outcome of this track is highly dependent on what part of the data science pipeline you’re wishing to contribute. Here are some project suggestions to give you an idea of what might fall under this track!

  • Data Engineering/Scraping Project Suggestions
    • Combining related datasets in an interesting/useful way
    • Ex: Collecting data tied to different US States to develop an API for easily accessing and comparing information
  • ML Tools
    • WYSIWIG Neural Network Application
    • Web application that lets you create drag n drop neural networks
  • Data Visualization Tools
    • Automatic Data Summarization - Auto-generate summaries and visualizations of arbitrary CSV files
    • Cloropleth Plotting Toolkit - Matplotlib / Altair / Geopandas have great utilities for this, but I always have to refer to documentation because I forget! It would be great if I could pass a dataframe with “state, numeric feature” to a function to get a quick plot.
    • Complex Data Visualization to React Component - Can you make it easier to turn a complex & potentially live visualization into a component for a web frontend to consume?

Houston/Texas Trends Track


This track is designed for students who are specifically interested in working with a Houston or Texas related dataset. These projects should draft a narrative about a trend observed from a local dataset.


The outcome of this track is a presentation that paints a coherent story with supporting figures from Houston / Texas related datasets. Great projects might extend this presentation to include a simple policy suggestion to the Houston City Government or Texas State Government.

Kinder UDP

The Kinder Urban Data Platform has a wealth of data collected and maintained by the Kinder Institute. Find datasets here:

Sample Datasets:

Open Data Houston

Houston’s open data portal! Collected resource of publicly accessible data on Houston Local Govt. activities.

Texas Open Data Portal

Texas’ open data portal! Collected resource of publicly available data on Texas Govt. activities.

Other Interesting Datasets To Consider

College/University Survey Data

Get the Dataset Here

The datasets available from this search tool offer a wide variety of data ranging from institutional characteristics of colleges, to their test scores and graduation rates. Choose the year and type of data you are looking for and press 'continue'. The column labeled data file has a download link for the dataset. The rightmost column labeled dictionary lets you download a spreadsheet containing a breakdown of what the features/labels in the dataset represent.

2018 World Cup Match Data

Get the Dataset Here

The leftmost link contains match data from the 2018 world cup courtesy of fivethirtyeight. The right link contains their tournament predictions do not copy this as your own result, find a new way to analyze the data.

Divorce Prediction Dataset

Get the Dataset Here

Each row in this dataset represents an individuals responses to a survey containing 54 questions about various aspects of their marriage. The vectors are labeled based on whether or not the relationship ended in divorce. Have fun!

UK Online Retail Invoices

Get the Dataset Here

This dataset contains the data from about 1,000,000 invoices from a UK online retail store between 2009 and 2011. The link 'data folder' at the top contains a download link for the dataset.