Experiments with Getting Web Content Using ColdFusion and PHP
So you want to get content from another web site out there to use on your site. You may be doing screen scraping of a page out there… But a better use is to get info from some sort of a web API.
Case in point: Calling a URI from the U.S. Weather Service to get current weather information.
… This URI will return weather data from Chicago’s Midway Airport
Here is a ColdFusion file for getting info on the current weather in Chicago:
It works fine. I do any Ajax call to “weather.cfm” from my home page, parse of the current temperature, and weather description and display it on my home page. Nice!
If I call “weather.cfm” directly, I get weather data back for Chicago’s weather:
Below you can roughly see how it displays this page in Safari:
If you were to view the source, this is approximately what you would see, XML output:
This was running in ColdFusion Developer on my iMac and I can access on my personal wifi network.
I also wanted to do the same sort of thing using PHP. I decided I would use cUrl. Here is the code I used in a file called “weather.php”:
Notice that the URL is the same one as I’m using in the ColdFusion example.
So what happens if I directly call this “weather.php” page I created in my Safari web browser?
Well, we get a page like the one displayed above. Bummer! This is running on a PHP MAMP server on the same iMac as the ColdFusion server is. And trust me, the ColdFusion server is not calling this URL with any special permissions!
This led me to suspect that there was something different about the HTTP Header being sent by the ColdFusion server than was being sent by the PHP server using cURL.
But how to figure out what ColdFusion is doing differently than PHP? Create a new PHP page to call instead of calling the xml file…
HTTP Header Test Page
I was going to create a PHP page that would look at it’s HTTP Header values and output them to the page for me to be able to see!
Here it is:
Now, if we changed weather.php to point to this URL, what do we get?
Note that the ‘1’ at the bottom is an artifact of cUrl (unless you set the CURLOPT_RETURNTRANSFER option to true.
What about doing the same thing with weather.cfm ?
There is definitely a difference between the two. Both have the same value for “Host”. Not much else is the same! It could be that the PHP request has HTTP Headers that the server does not like… But I’m going with the assumption that the PHP request is MISSING one or more HTTP headers that the web server (w1.weather.gov) is expecting. So lets modify our weather.php file:
Notice above how I added a new block of code (lines 8 through 12). This is adding three headers to our HTTP Header: ‘User-Agent’, ‘Connection’, and ‘Accept-Encoding’. I saved my changes and refreshed this page.
BINGO! IT WORKED!
But is the server looking for all three of these headers?
I remove ‘Accept-Encoding’. I refresh the browser. It still works.
I remove ‘Connection’. I refresh the browser. It still works.
And (of course) I remove ‘User-Agent’. I refresh the browser. And of course it fails.
So, ‘User-Agent’ is the key. Currently, in our example, it is set to the value of ‘ColdFusion’. Because that is the value I got when running the ColdFusion page. But actually (of course) our page is a PHP page when is requesting the info.
I change the value of ‘User-Agent’ from ‘ColdFusion’ to ‘PHP’. I refresh the browser and it works.
I wonder, is: w1.weather.gov looking for specific values for this header, or just that the ‘User-Agent’ header is present in the HTTP header?
So, I change the value of ‘User-Agent’ to: ‘SugarBoogers’. I refresh the page and it works! This means that the server (at least in this case) is just checking to make sure that the ‘User-Agent’ HTTP Header is present and has a value… but doesn’t care WHAT the value is (I’m sure that ‘SugarBoogers’ is not a common user agent to check for!
Wrapping It Up
You might be able to “screen scrape” a web page without custom setting any HTTP headers. But I suspect that if your calling some sort of XML feed, JSON resource, or web service URI, there’s a good chance that you will need to set the ‘User-Agent’ HTTP header in order to get it to work.
Any comments? Thoughts? Let me know.
- Info on setting HTTP Headers with cURL:
- Info on getting a PHP page’s HTTP Header values: