- Pip install Boto3 - Thu, Mar 24 2022
- Install Boto3 (AWS SDK for Python) in Visual Studio Code (VS Code) on Windows - Wed, Feb 23 2022
- Automatically mount an NVMe EBS volume in an EC2 Linux instance using fstab - Mon, Feb 21 2022
In a previous post, I outlined the options you have to download files with different Internet protocols. You use Invoke-WebRequest to download files from the web via HTTP and HTTPS. However, the cmdlet enables you to do much more than just download files; you can use it to analyze to contents of web pages and use the information in your scripts.
The HtmlWebResponseObject object
If you pass a URI to Invoke-WebRequest, it won’t just display the HTML code of the web page. Instead, it will show you formatted output of various properties of the corresponding web request. For example:
$WebResponse = Invoke-WebRequest "http://www.contoso.com" $WebResponse
Storing HtmlWebResponseObject in a variable
Like most cmdlets, Invoke-WebRequest returns an object. If you execute the object’s GetType method, you will learn that the object is of the type HtmlWebResponseObject.
$WebResponse.GetType()
As usual, you can pipe the object to Get-Member to get an overview of the object’s properties:
PS C:\> $WebResponse| Get-Member TypeName: Microsoft.PowerShell.Commands.HtmlWebResponseObject Name MemberType Definition ---- ---------- ---------- Equals Method bool Equals(System.Object obj) GetHashCode Method int GetHashCode() GetType Method type GetType() ToString Method string ToString() AllElements Property Microsoft.PowerShell.Commands.WebCmdletElementCollection AllElements {get;} BaseResponse Property System.Net.WebResponse BaseResponse {get;set;} Content Property string Content {get;} Forms Property Microsoft.PowerShell.Commands.FormObjectCollection Forms {get;} Headers Property System.Collections.Generic.Dictionary[string,string] Headers {get;} Images Property Microsoft.PowerShell.Commands.WebCmdletElementCollection Images {get;} InputFields Property Microsoft.PowerShell.Commands.WebCmdletElementCollection InputFields {get;} Links Property Microsoft.PowerShell.Commands.WebCmdletElementCollection Links {get;} ParsedHtml Property mshtml.IHTMLDocument2 ParsedHtml {get;} RawContent Property string RawContent {get;} RawContentLength Property long RawContentLength {get;} RawContentStream Property System.IO.MemoryStream RawContentStream {get;} Scripts Property Microsoft.PowerShell.Commands.WebCmdletElementCollection Scripts {get;} StatusCode Property int StatusCode {get;} StatusDescription Property string StatusDescription {get;}
Parse an HTML page
Properties such as Links or ParsedHtml indicate that the main purpose of the cmdlet is to parse web pages. If you just want to access the plain content of the downloaded page, you can do so through the Content property:
$WebResponse.Content
There also is a RawContent property, which includes the HTTP header fields that the web server returned. Of course, you can also only read the HTTP header fields:
$WebResponse.Headers
Headers of a web request
It may also be useful to have easy access to the HTTP response status codes and their descriptions:
$WebResponse.StatusCode $WebResponse.StatusDescription
The Links property is an array of objects that contain all the hyperlinks in the web page. The most interesting properties of a link object are innerHTML, innerText, outerHTML, and href.
The URL that the hyperlink points to is stored in href. To get a list of all links in the web page, you could use this command:
$WebResponse.Links | Select href
Displaying a web page’s links
outerHTML refers to the entire link as it appears together with the <a> tag: <a href="http://contoso.com">Contoso</a>. Of course, other elements can appear here, such as additional attributes of the <a> element or additional HTML elements after the start tag (<a>), such as image tags. In contrast, the innerHTML property only stores the content between the start tag and the end tag (</a>) together with enclosed additional HTML elements.
The innerText property strips all HTML code from the innerHTML property. You can use this property to read the anchor text of a hyperlink. However, if the additional HTML elements exist inside the <a> element, you will get the text between those tags as well.
Note that the Link object also has an outerText property, but its contents will always be identical to the innerText property if you read a web page. The difference between outerText and innerText only matters if you write HTML code, which we don’t do here.
The Image property can be handled in a similar way as the Link property. It, of course, does not contain the images. Instead, it stores objects with properties that contain HTML code that refers to the images. The most interesting properties are width, height, alt, and src. If you know a little HTML, you will know how to deal with these attributes.
The following example downloads all images from a web page:
$WebResponse= Invoke-WebRequest https://mywebsite.com/page ForEach ($Image in $WebResponse.Images) { $FileName = Split-Path $Image.src -Leaf Invoke-WebRequest $Image.src -OutFile $FileName }
$WebResponse.Images stores an array of image objects from where we extract the src attribute of the <img> element, which refers to the location of the image. With the help of the Split-Path cmdlet, we get the file name from the URL, which we use to store the image in the current folder.
The properties that you see when you pipe an HtmlWebResponseObject object to Get-Member are those that you need most often when you have to parse an HTML page. If you are looking for other HTML elements, you can use the AllElements and ParsedHTML properties.
AllElements (you guessed it already) contains all the HTML elements that the page contains:
$WebResponse.AllElements
Of course, this also includes <a> and <img> elements, which means that you can also access them through the AllElements property. For instance, the command below, which displays all the links in a web page, is a bit more longwinded alternative to $WebResponse.links:
$WebResponse.AllElements | Where {$_.TagName -eq "a"}
ParsedHTML gives you access to the Document Object Model (DOM) of the web page. One difference from AllElements is that ParsedHTML also includes empty attributes of HTML elements. More interesting is that you can easily retrieve additonal information about the web page. For example, the following command tells you when the page was last modified:
$WebResponse.ParsedHtml.IHTMLDocument2_lastModified
Determining when a web page was last modified
Submit an HTML form
Invoke-WebRequest also allows you to fill out form fields. Many websites use the HTTP method GET for forms, in which case you simply have to submit a URL that contains the form field entries. If you use a web browser to submit a form, you usually see how the URL is constructed. For instance, the next command searches for PowerShell on 4sysops:
Invoke-WebRequest https://4sysops.com/index.php?s=powershell
If the website uses the POST method, things get a bit more complicated. The first thing you have to do is find out which method is used by displaying the forms objects:
$WebResponse = Invoke-WebRequest https://twitter.com $WebResponse.Forms
Displaying the forms in a web page
A web page sometimes has multiple forms using different methods. Usually you recognize the form you need by inspecting the Fields column. If the column is cut off, you can display all the form fields with this command:
$WebResponse.Forms.Fields
Let’s have a look at a more concrete example. Our goal is to scrape the country code of a particular IP address from a Whois website. We first have to find out how the form field is structured. Because we are working on the PowerShell console, it is okay to use the alias of Invoke-WebRequest:
(wget https://who.is).forms
Determining the form field of a Whois website
We see that the website uses the POST method, that the URL to be called to process the query is https://who.is/domains/search, and that two form fields are required. The default value of the Search_type field is “Whois” and the query field is most likely the field for the IP address. We are now ready to scrape the country code of the IP address from the result page:
$Fields = @{"search_type" = "Whois";"query" = "134.170.185.46"} $WebResponse = Invoke-WebRequest -Uri "https://who.is/domains/search" -Method Post -Body $Fields $Pre = $WebResponse.AllElements | Where {$_.TagName -eq "pre"} If ($Pre -match "country:\s+(\w{2})") { Write-Host "Country code:" $Matches[1] }
Update: The example no longer works because the web page uses a different form field now. You can use the field variable now:
$Fields = @{"searchString" = "134.170.185.46"}
In the first line, we define a hash table that contains the names of our two form fields and the values we want to submit. In line 2, we store the result of the request page of the query in a variable. The web page returns the result within a <pre> element, and we extract its content in the next line.
We then use the -match operator with a regular expression to search for the country code. “\s+" matches any white space character, and “\w{2}” is supposed to match the country code, which consists of two characters. The parentheses group the country code, which allows us to access the result through the automatic variable $Matches.
Good stuff. Note that I got a message asking me to accept cookies every time I tried to do anything with the page contents (e.g. searching for a tag). Got round this by using -UseBasicParsing on Invoke-WebRequest which uses Powershell’s built in parser rather than Internet Explorer’s.
I used this to build a proof of concept to download Dilbert strips from the archive – download a page, find the appropriate image tag, download that image, add 1 to the date and do the same. Obviously not using it to download en masse, probably get blocked for that but very pleased it worked 🙂
Kris, thanks. I think I saw the cookie request only once. Maybe this is an IE setting? As to downloading en masse, you have no idea how many crawlers are out there and it is really hard to block them. Every minute or so another crawler hits 4sysops.
If not a webpage, but a file for download, how would you get the file information without actually downloading the file contents? Web response method will actually pull the entire file, in effect downloading or reading the file in total when all that is desired is just the file information, like size of the file.
The file properties are stored in the filesystem on the host. Web servers usually don’t transmit this information. So if you want to read the file metadata without downloading the file, you need an API on the host that offers this data.
If the remote host is a Windows machine you can use PowerShell remoting to read the file size:
invoke-command -computername RemoteComputerName -scriptblock {(get-item c:\windows\notepad.exe).length}
I know this is a really old comment, but you could use Invoke-WebRequest with Head method. The Headers property contains basic file info like Content-Length, Content-Type, Date, Last-Modified, Server etc. So, using this approach, you could get the file size like this:
If the file size is in gigs, you could use PowerShell’s built-in math function like this to get the size:
Hope it can help someone in future.
Thanks for the informative article on Invoke-WebRequest. Just what I was looking for.
I need to download file from https://raw.githubusercontent.com/h5bp/html5-boilerplate/master/src/index.html, modifie it (for example add some tags… bootstrapGridSystem.css). In powershell it looks like:
$results = irm -uri “https://raw.githubusercontent.com/h5bp/html5-boilerplate/master/src/index.html”
$html = $results.ParsedHtml
But how can I modifie the object? Is is possible to modify it?
For example add <link href=”bootstrapGridSystem.css”> and <link href=”foundationGridSystem”>
Because this code:
$linkBoot=$html.createElement(“link”)
$linkBoot = “css/bootstrapGridSystem.css”
$headTag=$html.getElementsByTagName(“head”)[0]
$headTag.appendChild(“link”) didn’t modify opbject $results.content?
HI,
I am using your script and leveraging it to download image file from a list of URLS; The Script loops through each URL and invokes a web request and downloads images from it. The problem that i am facing is the images are by default getting downloaded in 320X240; where as on the actual site the image when opened in a new tab and right click downloaded, gives me a 960X720 pix file, which is what i am after.
here is the script.
$url = get-content “urls.txt”
$j = $url.count
for ($i= 0 ; $i -le $j ; $i++)
{
$WebResponse = Invoke-WebRequest -uri $url[$i]
ForEach ($Image in $WebResponse.Images)
{
$FileName = Split-Path $Image.src -Leaf
$d = Invoke-WebRequest $Image.src
}
}
The problem is that the src attribute of the image tag only points to the image that you see on the web page. The URL of the image that is displayed when you click an image is in an a tag before the image tag. Thus, you have to retrieve all links in the web page (as explained in the article) and then get all URLs that point to images. Those URLs all have image extensions such as .jpg or .png. You could work with a regular expression to sort out these URLs.
Thank you, after looping through all the URLS i’ve got the final output.
here is the working code; albiet it can be improved
$source = Invoke-WebRequest -uri “<enter URL here>” `
| Select-Object -ExpandProperty links | Select-Object -ExpandProperty href | Select-String “part”
$j = $source.count
for($k = 0 ; $k -le $j ; $k++)
{
#write-host $source[$k].Line
$links = Invoke-WebRequest -uri $source[$k].Line `
| Select-Object -ExpandProperty links | Select-Object -ExpandProperty href | Select-String “.PNG”
foreach ($link in $links)
{
$filename = Split-Path $link.line -Leaf
Invoke-WebRequest -uri $link.Line -OutFile “C:\users\admin\Desktop\images\$k$filename”
}
Thanks for sharing. If the web page not only contains links to PNGs but also to JPGs, you could use this:
Does this still work? Perhaps the website has changed its search/query methods. I am not seeing a <pre> tag.
The code does no longer work because they changed the form field. This should work now:
$Fields = @{“searchString” = “134.170.185.46”}
$WebResponse = Invoke-WebRequest -Uri “https://who.is/domains/search” -Method Post -Body $Fields
$Pre = $WebResponse.AllElements | Where {$_.TagName -eq “pre”}
If ($Pre -match “Country:\s+(\w{2})”)
{
Write-Host “Country code:” $Matches[1]
}
I was trying to use this on a website I visit to see a list they post there weekly. There is no RSS feed or anything so you have to manually go to the site. I thought it would be fun to automate scraping the weekly options and emailing them to myself, which is where your article came in very handy, thank you!
The problem is while I can pull back the URL, it looks like they are embedding the stuff I want not in the actually page, but they are pulling it in from a frame (I think).
I’m decent at basic PowerShell scripting but haven’t looked at HTML since the late 90s. Any ideas? I also tried RawContent and AllElements to no avail.
If it is an iframe, you can just load the iframe’s URL. In most browsers, you can right-click the element in the web page and then click “Inspect.” You should then be able to see the URL where the content that interests you is coming from.
I tried to Inspect Element and this is what I see:
<div style=”left: 496px; width: 475px; position: absolute; top: 98px;” class=”txtNew” id=”WRchTxtd-17bd” data-reactid=”.0.$SITE_ROOT.$desktop_siteRoot.$PAGES_CONTAINER.1.1.$SITE_PAGES.$c1a73_DESKTOP.1.$WRchTxtd-17bd”><p class=”font_8″ style=”font-size:28px; text-align:center;”>
Not much help there from what I can discern so I guess my question is: is there a way to tell PowerShell to just download/render the page as a browser would then I can parse it from there?
I guess the div box is filled by JavaScript. Where should PowerShell render the page? In the console? And why would that help with parsing? PowerShell creates objects of the HTML elements in the web page. However, PowerShell doesn’t understand JavaScript. I suppose you are better off with a web scraping tool that has a GUI.
Michael thank you for your reply. I was able to get a bit further, but it still isn’t working correctly for some reason. Would you mind taking a look and letting me know if you see what I’m doing wrong?
$site = Invoke-WebRequest -Uri “https://www.localfarefarmbagsouth.com/about_us”
($site.ParsedHtml.getElementsByTagName(‘p’) | Where {$_.className -eq ‘font_8’}).innerText
Try this:
You will see there is no HTML between the body tags. This is all JavaScript. You need a scraping tool with an engine that can execute JavaScript.
Understood, thanks!
Thank you so much for this page! It has helped me work on something I have been struggling with for days. Now,
$WebResponse = Invoke-WebRequest https://www.ebay.com/posters
$WebResponse.AllElements | ? { $_.Class -eq ‘price’ } | select innerText
This does work and give me a list of, 10 items (lets say) but is there a way I can store one of the specific values, like always the 3rd in the list that populates, into a variable/array?
THANK YOU 🙂
I am getting a “page not found” with your URL. Can you provide a public page for your example?
Excellent article! I’m trying to pull a specific piece of text from a REST query and use it as a qualifier in a where clause for another PowerShell query… but I’ve hit a wall. Any thoughts?
$site = Invoke-WebRequest “https://services1.arcgis.com/bqfNVPUK3HOnCFmA/arcgis/rest/services/Traffic_Accident_Locations/FeatureServer/0/query?where=1=1&outFields=AccidentNumber&returnGeometry=false&orderByFields=AccidentNumber+DESC&resultRecordCount=1”
$maxAccidentID = $site.AllElements | ? {$_.Class -eq ‘ftrTable’} | select td
$url = “http://policeview.johnscreekga.gov/resource/fke7-a2vb.geojson?$where=”+ $maxAccidentID +”
$filePath = ‘C:\temp\PoliceAccidents.geojson’
Invoke-RestMethod -Uri $url -Method GET -OutFile $filePath
Hi, Iam new to powershell scrripting and I need a powershell script which will display entire contents of webpage or last 100 lines of that webpage.. Please help
Try this: (wget google.com).content
Displaying the last 100 lines is more tricky because there are many different ways to start a new line HTML.
The other question is if it makes sense to “display” a web page with PowerShell because you usually want to parse a web page with a programming language.
Hi, I need to transfer pcloud files to uptobox. But pcloud download links changes for each ip, that is why it’s link not working in remote url upload of any file hosting site.
Does powershell software helps me in this matter? If yes, please tell me how to do it?
Please reply to this comment.
Thank You.
I am not familiar with pcloud and uptobox. Perhaps the downlinks change according to a certain pattern that you use in your script? Depending on the number of users your organization has, you might consider instructing your users how they can download the files to a local drive and then upload them to the new provider.
I am user at pcloud, so I want to transfer my files to uptobox.com.
Pcloud has hotlinking protection, Do you know how to bypass it?
My friend told me use fiddler & use webrequest. But I am new to this software. So can you help me please.
Thank you.
I can’t help you because I don’t know Pcloud. I recommend that contact their support and ask how you can download your data to your PC.
HELLO, I can download the data from PC, it will take huge time to upload from PC rather than REMOTE URL UPLOAD.
1)please create a account (free one) at pcloud & uptobox. (you can use fake email id for register)
2) upload a file in pcloud & click download , So it will give you download link, I am asking you to parse the download link.
3)So it will give you get actual link to use in remote url upload of Uptobox.
Did you understand my problem?
Thank You.
I can’t help you with that and I doubt that it is possible what are you planning.
hi
i m new to this, bt i hv task to fetch each properties from web services/WSDL( which is in xml format).
so please help me
I am trying to create something similar to check for URL reputation using sitereview.bluecoat.com.
How can you create a powershell to pull list of URL and check rating and return results?
Hi,
Thanks for the great information. I am monitoring a WEB site and one of the URLs opens a popup window. Is there a way for me to confirm that the popup has the required data. I can see the popup but I can’t find a way to get any of the elements that are in the popup.
Thanks
Wayne, I guess the popup is JavaScript. You can’t parse this easily with PowerShell. You have to look at the source code of the web page and analyze the JavaScript code. The popup’s content might come from another URL. Maybe you can load this URL and then parse the content with PowerShell.
Back to the cookie popup problem… – if I use the -usebasicparsing switch on invoke-webrequest, the returned object does not even have the allelements or parsedhtml properties, which means I’ve really only got content to work with, and being a string isn’t making that easy… if I omit the -usebasicparsing switch, I get those additional properties on the object, but absolutely every attempt to use them causes a cookie popup to appear, and no matter what I respond the command never seems to complete… – both ISE and PS console hang & I have to break out of it…
Anyone have any idea why this happens, and how to stop it or work around it?
FYI, I’m trying to obtain the full name of Exchange update rollups from https://docs.microsoft.com/en-us/exchange/new-features/build-numbers-and-release-dates?view=exchserver-2019#exchange-server-2010 where I have the long format build number, & I want to get the full product name…
Cheers!
Paul G.
Thanks for the useful article.
How would you go about getting the content you receive into a variable, and then working with it from there?
Specifically, I'm trying to get our Azure Billing Report, which is a CSV file, and then write it to Azure blob storage. As I am trying to do this from an Azure function, I don't really have the luxury of writing to a file first.
When I do
everything is fine, and I get a nice CSV.
When I do
all I get is one big string of text.
Any ideas?
I suppose you also added a URL to your command. What do you see if you just execute $BlobCSV? And what do you get with this:
Thanks for taking the time to reply 🙂
Yes, I do have a url, it was late and I forgot to include it here…
I guess these are all the types you'd expect.
I guess this isn't helped by the billing report having two lines with text before the CSV columns start on line 3…
As you can see, the variable doesn't contain just a string, but a WebResponseObject. This means you can use this object to extract all the data you need.
Well, yes, but the content is just one long string, hence my question…
$BlobCSV.Content might technically be an array, but [0] has everything…
No, it is not just a long string. You can use object properties to parse the result. I gave a few examples in the article.
Hi, I'm trying to make a webrequest in powershell that contains a base64 and it should return a base64 in the response, but instead returns an xop. Through SOAPUI I solved by disabling the MTOM. Is there a parameter to force the base64 response?
Thanks