<img src="http://nci.org.au/wp-content/themes/nci/img/img-logo-large.png", width=400>
PDAL (the Point Data Abstraction Library) is an open source library for handling massive point data sets. It's name is derived from GDAL - since it aims to sit in the same space for point data.
PDAL is actually a C library - if you're writing applications you can insert it into your code. It also has python bindings. Today we'll explore some of PDAL's capabilities using it's command line applications - which are mostly wrappers to PDAL's pipeline functions.
We'll also us a sneaky bit of LibLAS: http://www.liblas.org
...but you'll hopefully see why we'd prefer PDAL in the end.
If you deal with:
...PDAL can make your life a lot simpler with some basic processing tasks.
PDAL is demonstrated at this workshop because point data exist on NCI's filesystem, and the VDI is the perfect place to work on them without having to ship unneccessary data. Using the VDI as a front end for development is an excellent way to protoype point data analyses on data that exist in the NCI filesystem; or define specific subsets to take away for analysis.
A lightning speed overview of point data handling and manipulation:
We will do all this on the command line, viewing results in CloudCompare or QGIS. These tasks are based on the PDAL workshop here: http://www.pdal.io/workshop/index.html, and are very much 'learn by doing'. PDAL is very well documented, please keep reading for more information.
...so feel free to zoom ahead, create and share!
We will use a LiDAR survey over Merimbula in 2013, from the Geoscience Australia elevation reference collection. Here is it's catalogue entry:
THREDDS
http://dapds00.nci.org.au/thredds/catalog/rr1/Elevation/Merimbula0313/catalog.html
Geonetwork
https://geonetwork.nci.org.au/geonetwork/srv/eng/catalog.search#/metadata/f9082_1236_9859_8989
Licensed under Creative Commons (CCBY4): http://dapds00.nci.org.au/thredds/fileServer/licenses/rr1_licence.pdf
The path to the data on your VDI desktop is:
/g/data1/Elevation/Merimbula0313/
...and LAS tiles are in:
/g/data1/Elevation/Merimbula0313/Tiles_2k_2k
Try:
pdal info /path/to/Elevation/Merimbula0313/z55/2013/Mass_Points/LAS/AHD/LAS/Tiles_2k_2k/Merimbula2013-C3-AHD_7605910_55_0002_0002.las
...and compare with:
lasinfo /path/to/Elevation/Merimbula0313/z55/2013/Mass_Points/LAS/AHD/LAS/Tiles_2k_2k/Merimbula2013-C3-AHD_7605910_55_0002_0002.las
Straight into the fire! We're going straight to PDAL's pipeline architecture, which gives it an enormous amount of felxibility and power. A pipeline is a set of operations chained together and defined in a JSON file. You'll see it in action here
LAS tiles are pretty hard to handle - you get a lot of extra data that you may not want, and they are pretty much always boringly square. If we only need a certain region, we can get just those points using PDAL.
The image here shows a lot of data - an index of LiDAR tiles labelled by filename, an OpenStreetMaps layer giving us an idea of where Merimbula town is, and a polygon roughly around the urban area. These are all in the QGIS project given above.
The image here shows that Merimbula town covers several LIDAR tiles. This is an excellent challenge - it means some tiles have a lot of data we don't want, and it means we have to sift through multiple tiles to find our data of interest. We can see a rough polygon surrounding our region of interest, which is defined in QGIS and save as GeoJSON.
Here's the GeoJSON polygon:
{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::28355" } },
"features": [
{ "type": "Feature", "properties": { "id": 0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 759094.480855233967304, 5913008.271593709476292 ], [ 758464.699909413931891, 5912716.199270982295275 ], [ 757743.646362751838751, 5912898.744472593069077 ], [ 757716.26458250079304, 5913304.907546310685575 ], [ 757373.992329337401316, 5913418.998297326266766 ], [ 757018.029186049010605, 5913724.761510098353028 ], [ 757556.537531022448093, 5913784.088700683787465 ], [ 757828.153738587978296, 5913997.946465536952019 ], [ 757828.153738587396219, 5914326.52782854065299 ], [ 758357.534823469701223, 5914381.291389083489776 ], [ 758877.788648267393, 5914554.709330711513758 ], [ 758850.406868015765212, 5914810.272613044828176 ], [ 759042.079329782165587, 5914837.654393311589956 ], [ 759151.606450793216936, 5914673.363711818121374 ], [ 759370.660692813224159, 5914709.872752171941102 ], [ 759361.533432727912441, 5915102.34493575617671 ], [ 760593.713544093072414, 5915138.853976195678115 ], [ 761177.858189482591115, 5915047.581375411711633 ], [ 761123.094628979102708, 5914235.255227984860539 ], [ 761260.003530243760906, 5914007.07372591085732 ], [ 761570.33037310524378, 5913952.31016543880105 ], [ 761369.530651255394332, 5913559.837981833145022 ], [ 761141.349149147979915, 5913459.438120897859335 ], [ 760484.186423085397109, 5913377.292780089192092 ], [ 759817.896436938317493, 5913632.856062367558479 ], [ 759516.696854161447845, 5913550.710721591487527 ], [ 759416.29699323582463, 5913286.020179163664579 ], [ 759094.480855233967304, 5913008.271593709476292 ] ] ] } }
]
}
...but PDAL needs WKT (for now) - using this website: http://rodic.fr/blog/online-conversion-between-geometric-formats/, we can get a WKT polygon:
POLYGON((759094.480855234 5913008.2715937095,758464.6999094139 5912716.199270982,757743.6463627518 5912898.744472593,757716.2645825008 5913304.907546311,757373.9923293374 5913418.998297326,757018.029186049 5913724.761510098,757556.5375310224 5913784.088700684,757828.153738588 5913997.946465537,757828.1537385874 5914326.527828541,758357.5348234697 5914381.2913890835,758877.7886482674 5914554.7093307115,758850.4068680158 5914810.272613045,759042.0793297822 5914837.654393312,759151.6064507932 5914673.363711818,759370.6606928132 5914709.872752172,759361.5334327279 5915102.344935756,760593.7135440931 5915138.853976196,761177.8581894826 5915047.581375412,761123.0946289791 5914235.255227985,761260.0035302438 5914007.073725911,761570.3303731052 5913952.310165439,761369.5306512554 5913559.837981833,761141.349149148 5913459.438120898,760484.1864230854 5913377.292780089,759817.8964369383 5913632.856062368,759516.6968541614 5913550.7107215915,759416.2969932358 5913286.020179164,759094.480855234 5913008.2715937095))
We need to know which tiles contain our data - by labelling the tile index layer in QGIS we can pick out the following set:
Merimbula2013-C3-AHD_7565916_55_0002_0002.las
Merimbula2013-C3-AHD_7565914_55_0002_0002.las
Merimbula2013-C3-AHD_7565912_55_0002_0002.las
Merimbula2013-C3-AHD_7585916_55_0002_0002.las
Merimbula2013-C3-AHD_7585914_55_0002_0002.las
Merimbula2013-C3-AHD_7585912_55_0002_0002.las
Merimbula2013-C3-AHD_7605916_55_0002_0002.las
Merimbula2013-C3-AHD_7605914_55_0002_0002.las
Merimbula2013-C3-AHD_7605912_55_0002_0002.las
We create a JSON file which tells PDAL what to do:
nano merimbula_clip.json
..and paste in the following:
{
"pipeline": [
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7565916_55_0002_0002.las",
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7565914_55_0002_0002.las",
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7565912_55_0002_0002.las",
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7585916_55_0002_0002.las",
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7585914_55_0002_0002.las",
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7585912_55_0002_0002.las",
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7605916_55_0002_0002.las",
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7605914_55_0002_0002.las",
"../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7605912_55_0002_0002.las",
{
"type": "filters.crop",
"polygon": "POLYGON((759094.480855234 5913008.2715937095,758464.6999094139 5912716.199270982,757743.6463627518 5912898.744472593,757716.2645825008 5913304.907546311,757373.9923293374 5913418.998297326,757018.029186049 5913724.761510098,757556.5375310224 5913784.088700684,757828.153738588 5913997.946465537,757828.1537385874 5914326.527828541,758357.5348234697 5914381.2913890835,758877.7886482674 5914554.7093307115,758850.4068680158 5914810.272613045,759042.0793297822 5914837.654393312,759151.6064507932 5914673.363711818,759370.6606928132 5914709.872752172,759361.5334327279 5915102.344935756,760593.7135440931 5915138.853976196,761177.8581894826 5915047.581375412,761123.0946289791 5914235.255227985,761260.0035302438 5914007.073725911,761570.3303731052 5913952.310165439,761369.5306512554 5913559.837981833,761141.349149148 5913459.438120898,760484.1864230854 5913377.292780089,759817.8964369383 5913632.856062368,759516.6968541614 5913550.7107215915,759416.2969932358 5913286.020179164,759094.480855234 5913008.2715937095))"
"outside": false
},
"./merimbulatown_clip.las"
]
}
Then we execute the task using:
pdal pipeline merimbula_clip.json
Time to execute on an 8 core, 16GB RAM VM: 1 minute 12 seconds
**Yes!** If we first generate a tile index and give PDAL a better idea of which data to use, we can do this job in **49 seconds** including compression to LAZ using PDAL's tindex reader. This still generates a 46 mb LAz file (388 mb uncompressed LAZ), so a roundtrip time of a couple of minutes would be expected from a web-based request to a data subset on your machine.
PDAL also supports file name globbing in pipelines. However, giving our clip
This will result in a set of points inside your polygon being written into a .LAS file at the location specified in the pipeline file. Now you have a template for doing this job with pretty much any LAS tiles!
How much data do we have? pdal info ./merimbulatown_clip.las
tells us we have 11948596 points - so our time to process is really not bad (read 9 las tiles, do some clipping, merge and write an 11.9 million point dataset).
So let's explore our new dataset. In a new terminal, type:
cloudcompare &
...and use it's file/open menu to navigate to your newly made LAS file. Take a look at it there (hint - use the projections menu to convert the Z dimension to a scalar field to colour your points by height).
Here's an example. Note Cloudcompare's default point colouring scheme is 'point source' - which is useful but not pretty. Use the 'scalar fields' dropdown in the lower left panel to change your colour scheme - the screenshot here uses intensity.
If your polygon is quite large, or your points very dense, or both, you can still get a massive dataset! Use pdal info to get an estimate of how dense the data are, and figure out how much area you are clipping to estimate the final file size before going ahead.
Colour your points by height.
If you only want a specific classification from your cloud, here's how to filter it out using PDAL. We take advantage of the LAS specification, and know that the numerical tag for points of class 'building' is 6.
See Table 4.9 in this document: http://www.asprs.org/wp-content/uploads/2010/12/LAS_1-4_R6.pdf for a list of standard classification codes.
Save the following in merimbula_buildings.json:
{
"pipeline": [
"./merimbulatown_clip.las",
{
"limits": "Classification[6:6]",
"type": "filters.range"
},
"./merimbulabuildings.las"
]
}
...and execute:
pdal pipeline merimbula_buildings.json
Time to execute: 6.071 seconds
...and add a new layer in cloudcompare - here the point cloud of buildings we just made is coloured by point source, since it contrasts nicely.
Can you do this job using las2las? Why would you use PDAL instead?
When writing your own data, using ASPRS classification codes to flag bui;dings, ground, trees etc. is a good start - it helps to integrate with a lot of existing analysis tools.
Vendor-supplied ground can be poorly documented, or may not meet your requirements for other reasons. You can use PDAL to construct your own, with more control over how ground is parameterised. Also, PDAL implements some relatively recent ground classification algorithms.
Here use a Simple Morphological Filter (Pingel, T.J., Clarke, K.C., McBride, W.A., 2013. An improved simple morphological filter for the terrain classification of airborne LIDAR data. ISPRS J. Photogramm. Remote Sens. 77, 21–30.).
{
"pipeline":[
"./merimbulatown_clip.las",
{
"type":"filters.smrf",
"extract": false,
"cell": 2.0,
"window": 42.0
},
"./merimbulatown_smrf.laz"
]
}
pdal pipeline merimbula_smrf.json
This takes some time - and the results need inspecting. Did these parameters do a better job than the vendor?
PDAL and GDAL are hand-in-glove, wih PDAL employing GDAL's gdal2dem
to create raster elevation models. Here we look at two types, the DSM and the DTM/DEM.
First a DSM - Digital Surface Model - meaning the top of everything in the point cloud.
This is relatively simple given what we've accomplished already. Here's the pipeline to do it:
{
"pipeline": [
"./merimbulatown_clip.las",
{
"filename":"./merimbula_dsm",
"output_format":"tif",
"output_type":"all",
"grid_dist_x":"2.0",
"grid_dist_y":"2.0",
"type": "writers.p2g"
}
]
}
...and we get the pattern by now:
pdal pipeline merimbula_dsm.json
Time to execute: 27.356 seconds
For this one we need to ignore everything that is not 'ground' - let's do this without creating a new .las file containing only ground points. In this example we pipe the result of a filter into another process - starting to show how PDAL operations can be chained. In the first block we limit our selection to points of class 2 (ASPRS LAS for 'ground'), and then pass the result to our raster generator in the second block.
{
"pipeline": [
"./merimbulatown_clip.las",
{
"limits": "Classification[2:2]",
"type": "filters.range"
},
{
"filename":"./merimbula_dtm",
"output_format":"tif",
"output_type":"all",
"grid_dist_x":"2.0",
"grid_dist_y":"2.0",
"type": "writers.p2g"
}
]
}
...and once more:
pdal pipeline merimbula_dtm.json
Time to execute: 12.34 seconds
Not even a pipeline! PDAL can create a set of hexagonal bins to show the actual point coverage and density. While the metadata for this survey gives an average density for the entire dataset, what does it look like in real life?
First, let's set ourselves a sane sample area, say 20 m per cell. PDAL lets us set an edge size - so how long are edges in a 20 m^2 hexagon?
In [1]:
from math import sqrt
def hexside(area):
return sqrt(area / (3 / 2 * sqrt(3)))
In [2]:
hexside(20)
Out[2]:
OK, apply! This gives us a 20 m binned sqlite dataset that can be loaded straight into QGIS:
pdal density -i ./merimbulatown_clip.las -o merimbuladensity_20m.sqlite --filters.hexbin.edge_size=2.7745276335252114
Looking at our density map, classified by point count per hexagon, we see that most of the coverage has fewer than 3 points per square metre (cyan), and a reasonable proportion has fewer than 1 point per square metre (blue). The overall density metric has been tweaked a bit by swath overlaps, with up to 10 points per square metre!
Based on this information, a 2 m DTM is reasonable for a lot of the data (at least two samples per cell), but a 5 m DTM might be more realistic (at least 5 samples per cell).
How could we use PDAL to give us a consistent point density across our dataset?
*Hint: check out http://www.pdal.io/stages/filters.dartsample.html*
Sometimes we want relative height above ground (HAG). You might be interested in the tallest tree, or the mean height of shrubs, or in this case, finding the tallest building in a region. We now know how to pipe PDAL operations together, we're going to do some of that right now!
{
"pipeline":[
"merimbulatown_clip.las",
{
"type":"filters.height"
},
{
"type":"filters.ferry",
"dimensions":"HeightAboveGround = Z",
},
"merimbulatown_hag.las"
]
}
pdal pipeline merimbula_hag.json
Execution time 35 seconds
...gets us height above ground! Here's a profile view of the data. The new height-above-ground points are coloured, original data in grey (translated 100 m upward).
But where is the tallest building? One approach is to make a dataset of building heights and consult pdal info
for some guidance.
nano find_tallest.json
{
"pipeline":[
"merimbulatown_hag.las",
{
"limits": "Classification[6:6]",
"type": "filters.range"
},
"merimbulabuildings_hag.las"
]
}
This is our pipeline for selecting a single class. Use pdal info
to find out how tall the tallest buidling is, and use it to guide the next pipeline - which adds a height filter and returns a comma delimited text file.
{
"pipeline": [
"./merimbula_hag.las",
{
"limits": "Classification[6:6]",
"type": "filters.range"
},
{
"limits": "Z[23.5:25]",
"type": "filters.range"
},
"./merimbulabuildings_hag_tallest.txt"
]
}
pdal pipeline find_tallest.json
...which we can inspect in our QGIS project. Load the CSV file up and see where Merimbula's tallest 'building' is - styled as a red dot in the map below. Do you think they're really buildings?
This is a pretty clunky way of finding the highest point in a dataset. An alternative might be writing a custom filter - or using PDAL's python binding to ingest data into a numpy array. For the second approach, just be mindful of memory consumption.
***Note:*** using the PDAL-python approach, you could achieve most of the outcomes here using array operations - the main limitation being that you need to ingest all the data into memory at once (PDAL streams data - keeping only chunks at a time in memory).
PDAL has a wide range of readers built in - including a configurable text reader. Because is it is built with GDAL in mind it will generally ingest GDAL formats.
Let's try reading a gridded NetCDF file as a point cloud:
{
"pipeline":[
{
"type":"readers.gdal",
"filename":"./IR_gravity_anomaly_Australia_V1.nc"
},
{
"type":"filters.ferry"
"dimensions":"Latitude=Y, Longitude=X, grav_ir_anomaly=Z",
},
{
"type":"writers.laz",
"filename":"./gravityanomaly.laz"
}
]
}
Did this work? No! Why not? Because we have not built (yet) a schema to read out the gridded gravity anomaly!
Why include a broken example? To inspire you! What could we do if this function worked?
Also waiting to see if I can get a hold of some experimental geophysics point data... depending on how they're packed this might work a lot more easily! Then the open-ended question is 'why use PDAL and not a natove netCDF IO?'
Points are no longer your enemy! Using a completely open source stack and the VDI, you can whip point data into shape quickly and easily. You don't need to copy the raw tiles anywhere, only your derived datasets.
Still, keep an eye on those - they can get quite large.
Just to reiterate, PDAL and cloudcompare will happily handle data from 3D photogrammetry, terrestrial scanners, mobile (non airborne scanners), Zebedees, sonars, radars.. whatever produces data which are arranged as points in space.
This workbook only goes through simple pipelines - a more complex example might clip, classify, and compute height above ground in one step.
PDAL also underpins processing for visualisations like web-based massive point cloud rendering - we've used PDAL to apply colour from aerial imagery to LiDAR points prior to indexing for interactive web display.
PDAL Python bindings are also on the way - meaning that point data can be piped straight into Numpy arrays. The disadvantage of this approach is that all the points need to be held in memory at once, so the subsetting feature demonstrated here will be a critical first step.
...ask us! But we don't know everything - so we might point you to:
Because you didn't build it yet! Fork PDAL and have your merry way with it: https://github.com/PDAL/PDAL
Yes! PDAL is actively maintained and developed - with a lot of interest in the community to keep it's momentum up. The glitches we see all have workarounds - and their presence on the list of issues means that they are being attended to.