Work-in-progress. I will write more about each approach later in details.

Just summarizing the tools for connecting to Hadoop and running geospatial processing on a large dataset. I am working on a ~100 GB Hive Table which is just a small subset of the original dataset

  1. http://geospark.datasyslab.org/
  2. https://pypi.org/project/geopyspark/
  3. https://github.com/Esri/gis-tools-for-hadoop/wiki
  4. Kinetica GPU Database – Graph solver and Match solver
  5. PySpark python libraries
  6. Spatial Hadoop
  7. Alteryx – Using Connect-in-DB function to connect to Hadoop