English version was created automatically using Drupal module auto_node_translate and free DeepL translator.
How to get data from any web page using PHP Simple HTML DOM Parser
zveřejněno 2020-05-23
Occasionally, one would need to get data from some other website to which one does not have access. For example, current rates, temperatures ... or other similar data that is not otherwise accessible. They are just already displayed somewhere, so they exist in the HTML code of the page. The simplehtmldom library is used to find and retrieve them.
I came across an interesting project - the geologist Mr. Mucha had interesting places he visited displayed on his website. Each place had coordinates listed on its page, but as part of the regular text. Each such place belonged to a locality. My task was to display a map (using List Map) with each location on the location pages.
So I needed
- To get the necessary data (see this guide)
- Display it as map with individual site labels

https://www.geologie-astronomie.cz/Geologicke-lokality/Karlovarsko-a-za…
How to get machine data
To display the map points I needed to know their coordinate. To display different markers - red in the current location, others grey - I needed to know the location of that coordinate. Clicking on a point would show its name, an image, and a link to a detail page.
So my machine data needed to include:
- The x-coordinate (49°59'50.519 "N)
- The y-coordinate (12°37'12.477 "E)
- Location name (Karlovarsko-a-zapadni-Cechy)
- Location name (Kynžvartský kámen)
- Image URL
- Link to detail page

Since I was also showing all the markers on the map of each location, I needed one large file (array) with all the location information. Unfortunately I didn't have access to a full CMS or database, and anyway some data (e.g. coordinates) was inserted directly into the regular page text. So I needed to "seed" them somehow.

Getting the machine data
I started creating a simple PHP script. There, I defined an array of each location, based on its URL.
So $location_all = array('Plzensko', 'Karlovarsko-a-zapadni-Cechy', ...
I looped through this array and retrieved the HTML of each location:
$location_url = 'https://www.geologie-astronomie.cz/Geologicke-lokality/' . $location;$html = file_get_html($location_url);
file_get_html() is a basic simplehtmldom function that retrieves the DOM of a given resource, in my case the specific location where I was looking for the data I needed. After reviewing the code, I focused on finding the CSS class photo_gal_heading.

From the <a> tag I get a link of the location detail page, from the <img> tag I get a link to the image and the location name. Along with that, I also save information about the name of the location.
There are several such locations in each location. So all the locations - I know their link - I need to load again and get the coordinates from their HTML code this time. For this, I found it unnecessary to call the DOM again, I made do with the regular PHP function file_get_contents(), which returns the HTML code as one big string. From there, I looked for strings like '49°' or '50°' for latitude and '12°' - '17°' for longitude, respectively.

I then saved the entire resulting field as a JSON string.
This is what the relevant part of the code looks like:

And the resulting JSON (output json_encode()):

Optimization
It turns out that a script like this, processing dozens of locations (i.e. dozens of calls to file_get_contents()) together with nine calls to file_get_html() takes quite a long time, about 10 seconds. This meant unnecessary downtime in real operation. Since in this particular case we don't need to retrieve 100% actual data (sites don't change/create frequently), it is sufficient to call this script using cron, maybe once a day, and save the output to a file.
Conclusion
In this tutorial I showed a simple example of using the "PHP Simple HTML DOM Parser" library to retrieve data from an HTML file. You can quickly explore its capabilities by referring to its manual at https://simplehtmldom.sourceforge.io/manual.htm.
I then displayed the extracted data using listmaps - see. How to display custom waypoints and markers on a tourist map from Seznam?.