PowerShell’s Invoke-WebRequest is a powerful cmdlet that allows you to download, parse, and scrape web pages.

In a previous post, I outlined the options you have to download files with different Internet protocols. You use Invoke-WebRequest to download files from the web via HTTP and HTTPS. However, the cmdlet enables you to do much more than just download files; you can use it to analyze to contents of web pages and use the information in your scripts.

The HtmlWebResponseObject object

If you pass a URI to Invoke-WebRequest, it won’t just display the HTML code of the web page. Instead, it will show you formatted output of various properties of the corresponding web request. For example:

$WebResponse = Invoke-WebRequest "http://www.contoso.com"
$WebResponse

Storing HtmlWebResponseObject in a variable

Storing HtmlWebResponseObject in a variable

Like most cmdlets, Invoke-WebRequest returns an object. If you execute the object’s GetType method, you will learn that the object is of the type HtmlWebResponseObject.

$WebResponse.GetType()

As usual, you can pipe the object to Get-Member to get an overview of the object’s properties:

PS C:\> $WebResponse| Get-Member


   TypeName: Microsoft.PowerShell.Commands.HtmlWebResponseObject

Name              MemberType Definition                                                                 
----              ---------- ----------                                                                 
Equals            Method     bool Equals(System.Object obj)                                             
GetHashCode       Method     int GetHashCode()                                                          
GetType           Method     type GetType()                                                             
ToString          Method     string ToString()                                                          
AllElements       Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection AllElements {get;}
BaseResponse      Property   System.Net.WebResponse BaseResponse {get;set;}                             
Content           Property   string Content {get;}                                                      
Forms             Property   Microsoft.PowerShell.Commands.FormObjectCollection Forms {get;}            
Headers           Property   System.Collections.Generic.Dictionary[string,string] Headers {get;}        
Images            Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Images {get;}     
InputFields       Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection InputFields {get;}
Links             Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Links {get;}      
ParsedHtml        Property   mshtml.IHTMLDocument2 ParsedHtml {get;}                                    
RawContent        Property   string RawContent {get;}                                                   
RawContentLength  Property   long RawContentLength {get;}                                               
RawContentStream  Property   System.IO.MemoryStream RawContentStream {get;}                             
Scripts           Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Scripts {get;}    
StatusCode        Property   int StatusCode {get;}                                                      
StatusDescription Property   string StatusDescription {get;}

Parse an HTML page

Properties such as Links or ParsedHtml indicate that the main purpose of the cmdlet is to parse web pages. If you just want to access the plain content of the downloaded page, you can do so through the Content property:

$WebResponse.Content

There also is a RawContent property, which includes the HTTP header fields that the web server returned. Of course, you can also only read the HTTP header fields:

$WebResponse.Headers

Headers of a web request

Headers of a web request

It may also be useful to have easy access to the HTTP response status codes and their descriptions:

$WebResponse.StatusCode
$WebResponse.StatusDescription

The Links property is an array of objects that contain all the hyperlinks in the web page. The most interesting properties of a link object are innerHTML, innerText, outerHTML, and href.

The URL that the hyperlink points to is stored in href. To get a list of all links in the web page, you could use this command:

$WebResponse.Links | Select href

Displaying a web page’s links

Displaying a web page’s links

outerHTML refers to the entire link as it appears together with the <a> tag: <a href="http://contoso.com">Contoso</a>. Of course, other elements can appear here, such as additional attributes of the <a> element or additional HTML elements after the start tag (<a>), such as image tags. In contrast, the innerHTML property only stores the content between the start tag and the end tag (</a>) together with enclosed additional HTML elements.

The innerText property strips all HTML code from the innerHTML property. You can use this property to read the anchor text of a hyperlink. However, if the additional HTML elements exist inside the <a> element, you will get the text between those tags as well.

Note that the Link object also has an outerText property, but its contents will always be identical to the innerText property if you read a web page. The difference between outerText and innerText only matters if you write HTML code, which we don’t do here.

The Image property can be handled in a similar way as the Link property. It, of course, does not contain the images. Instead, it stores objects with properties that contain HTML code that refers to the images. The most interesting properties are width, height, alt, and src. If you know a little HTML, you will know how to deal with these attributes.

The following example downloads all images from a web page:

$WebResponse= Invoke-WebRequest https://mywebsite.com/page
ForEach ($Image in $WebResponse.Images)
{
    $FileName = Split-Path $Image.src -Leaf
    Invoke-WebRequest $Image.src -OutFile $FileName
}

$WebResponse.Images stores an array of image objects from where we extract the src attribute of the <img> element, which refers to the location of the image. With the help of the Split-Path cmdlet, we get the file name from the URL, which we use to store the image in the current folder.

The properties that you see when you pipe an HtmlWebResponseObject object to Get-Member are those that you need most often when you have to parse an HTML page. If you are looking for other HTML elements, you can use the AllElements and ParsedHTML properties.

AllElements (you guessed it already) contains all the HTML elements that the page contains:

$WebResponse.AllElements

Of course, this also includes <a> and <img> elements, which means that you can also access them through the AllElements property. For instance, the command below, which displays all the links in a web page, is a bit more longwinded alternative to $WebResponse.links:

$WebResponse.AllElements | Where {$_.TagName -eq "a"}

ParsedHTML gives you access to the Document Object Model (DOM) of the web page. One difference from AllElements is that ParsedHTML also includes empty attributes of HTML elements. More interesting is that you can easily retrieve additonal information about the web page. For example, the following command tells you when the page was last modified:

$WebResponse.ParsedHtml.IHTMLDocument2_lastModified

Determining when a web page was last modified

Determining when a web page was last modified

Submit an HTML form

Invoke-WebRequest also allows you to fill out form fields. Many websites use the HTTP method GET for forms, in which case you simply have to submit a URL that contains the form field entries. If you use a web browser to submit a form, you usually see how the URL is constructed. For instance, the next command searches for PowerShell on 4sysops:

Invoke-WebRequest https://4sysops.com/index.php?s=powershell

If the website uses the POST method, things get a bit more complicated. The first thing you have to do is find out which method is used by displaying the forms objects:

$WebResponse = Invoke-WebRequest https://twitter.com
$WebResponse.Forms

Displaying the forms in a web page

Displaying the forms in a web page

A web page sometimes has multiple forms using different methods. Usually you recognize the form you need by inspecting the Fields column. If the column is cut off, you can display all the form fields with this command:

$WebResponse.Forms.Fields

Let’s have a look at a more concrete example. Our goal is to scrape the country code of a particular IP address from a Whois website. We first have to find out how the form field is structured. Because we are working on the PowerShell console, it is okay to use the alias of Invoke-WebRequest:

(wget https://who.is).forms

Determining the form field of a Whois website

Determining the form field of a Whois website

We see that the website uses the POST method, that the URL to be called to process the query is https://who.is/domains/search, and that two form fields are required. The default value of the Search_type field is “Whois” and the query field is most likely the field for the IP address. We are now ready to scrape the country code of the IP address from the result page:

$Fields = @{"search_type" = "Whois";"query" = "134.170.185.46"}
$WebResponse = Invoke-WebRequest -Uri "https://who.is/domains/search" -Method Post -Body $Fields
$Pre = $WebResponse.AllElements | Where {$_.TagName -eq "pre"}
If ($Pre -match "country:\s+(\w{2})")
{
    Write-Host "Country code:" $Matches[1]
}

Update: The example no longer works because the web page uses a different form field now. You can use the field variable now:

$Fields = @{"searchString" = "134.170.185.46"}

In the first line, we define a hash table that contains the names of our two form fields and the values we want to submit. In line 2, we store the result of the request page of the query in a variable. The web page returns the result within a <pre> element, and we extract its content in the next line.

We then use the -match operator with a regular expression to search for the country code. “\s+" matches any white space character, and “\w{2}” is supposed to match the country code, which consists of two characters. The parentheses group the country code, which allows us to access the result through the automatic variable $Matches.

avataravatar
44 Comments
  1. Kris 8 years ago

    Good stuff. Note that I got a message asking me to accept cookies every time I tried to do anything with the page contents (e.g. searching for a tag). Got round this by using -UseBasicParsing on Invoke-WebRequest which uses Powershell’s built in parser rather than Internet Explorer’s.

    I used this to build a proof of concept to download Dilbert strips from the archive – download a page, find the appropriate image tag, download that image, add 1 to the date and do the same. Obviously not using it to download en masse, probably get blocked for that but very pleased it worked 🙂

    avatar
  2. Kris, thanks. I think I saw the cookie request only once. Maybe this is an IE setting? As to downloading en masse, you have no idea how many crawlers are out there and it is really hard to block them. Every minute or so another crawler hits 4sysops.

  3. Schorschi 7 years ago

    If not a webpage, but a file for download, how would you get the file information without actually downloading the file contents?  Web response method will actually pull the entire file, in effect downloading or reading the file in total when all that is desired is just the file information, like size of the file.

    • Author

      The file properties are stored in the filesystem on the host. Web servers usually don’t transmit this information. So if you want to read the file metadata without downloading the file, you need an API on the host that offers this data.

      If the remote host is a Windows machine you can use PowerShell remoting to read the file size:

      invoke-command -computername RemoteComputerName -scriptblock {(get-item c:\windows\notepad.exe).length}

      • I know this is a really old comment, but you could use Invoke-WebRequest with Head method. The Headers property contains basic file info like Content-Length, Content-Type, Date, Last-Modified, Server etc. So, using this approach, you could get the file size like this:

        (Invoke-WebRequest $downloadURI -Method Head).Headers."Content-Length"

        If the file size is in gigs, you could use PowerShell’s built-in math function like this to get the size:

        [Math]::Round((Invoke-WebRequest $downloadURI -Method Head).Headers."Content-Length"/1gb,2)

        Hope it can help someone in future.

        avatar
  4. Caroline 7 years ago

    Thanks for the informative article on Invoke-WebRequest. Just what I was looking for.

  5. Oleg 7 years ago

    I need to download file from https://raw.githubusercontent.com/h5bp/html5-boilerplate/master/src/index.html, modifie it (for example add some tags… bootstrapGridSystem.css). In powershell it looks like:
    $results = irm -uri “https://raw.githubusercontent.com/h5bp/html5-boilerplate/master/src/index.html”
    $html = $results.ParsedHtml
    But how can I modifie the object? Is is possible to modify it?
    For example add <link href=”bootstrapGridSystem.css”>  and <link href=”foundationGridSystem”>
    Because this code:
    $linkBoot=$html.createElement(“link”)
    $linkBoot = “css/bootstrapGridSystem.css”
    $headTag=$html.getElementsByTagName(“head”)[0]
    $headTag.appendChild(“link”) didn’t modify opbject $results.content?

  6. tejanagios 7 years ago

    HI,

    I am using your script and leveraging it to download image file from a list of URLS; The Script loops through each URL and invokes a web request and downloads images from it. The problem that i am facing is the images are by default getting downloaded in 320X240; where as on the actual site the image when opened in a new tab and right click downloaded, gives me a 960X720 pix file, which is what i am after.

    here is the script.

     

    $url = get-content “urls.txt”

    $j = $url.count

    for ($i= 0 ; $i -le $j ; $i++)

    {

    $WebResponse = Invoke-WebRequest -uri $url[$i]

    ForEach ($Image in $WebResponse.Images)

    {

    $FileName = Split-Path $Image.src -Leaf

    $d =  Invoke-WebRequest $Image.src

    }

    }

     

    • Author

      The problem is that the src attribute of the image tag only points to the image that you see on the web page. The URL of the image that is displayed when you click an image is in an a tag before the image tag. Thus, you have to retrieve all links in the web page (as explained in the article) and then get all URLs that point to images. Those URLs all have image extensions such as .jpg or .png. You could work with a regular expression to sort out these URLs.

  7. teja 7 years ago

    Thank you, after looping through all the URLS i’ve got the final output.

    here is the working code; albiet it can be improved

    $source = Invoke-WebRequest -uri “<enter URL here>” `

    | Select-Object -ExpandProperty links | Select-Object -ExpandProperty href | Select-String “part”

    $j = $source.count

    for($k = 0 ; $k -le $j ; $k++)

    {

    #write-host $source[$k].Line

    $links = Invoke-WebRequest -uri $source[$k].Line `

    | Select-Object -ExpandProperty links | Select-Object -ExpandProperty href | Select-String “.PNG”

    foreach ($link in $links)

    {

    $filename =  Split-Path $link.line -Leaf

    Invoke-WebRequest -uri $link.Line -OutFile “C:\users\admin\Desktop\images\$k$filename”

    }

     

    • Author

      Thanks for sharing. If the web page not only contains links to PNGs but also to JPGs, you could use this: 

      select-string -pattern ".+png|.+jpg"

  8. ron 6 years ago

    Does this still work?  Perhaps the website has changed its search/query methods. I am not seeing a <pre> tag.

    • Author

      The code does no longer work because they changed the form field. This should work now:

      $Fields = @{“searchString” = “134.170.185.46”}

      $WebResponse = Invoke-WebRequest -Uri “https://who.is/domains/search” -Method Post -Body $Fields

      $Pre = $WebResponse.AllElements | Where {$_.TagName -eq “pre”}

      If ($Pre -match “Country:\s+(\w{2})”)

      {

      Write-Host “Country code:” $Matches[1]

      }

  9. Steve Giovanni 6 years ago

    I was trying to use this on a website I visit to see a list they post there weekly.  There is no RSS feed or anything so you have to manually go to the site. I thought it would be fun to automate scraping the weekly options and emailing them to myself, which is where your article came in very handy, thank you!

    The problem is while I can pull back the URL, it looks like they are embedding the stuff I want not in the actually page, but they are pulling it in from a frame (I think).

    I’m decent at basic PowerShell scripting but haven’t looked at HTML since the late 90s. Any ideas?  I also tried RawContent and AllElements to no avail.

    • Author

      If it is an iframe, you can just load the iframe’s URL. In most browsers, you can right-click the element in the web page and then click “Inspect.” You should then be able to see the URL where the content that interests you is coming from.

  10. Steve Giovanni 6 years ago

    I tried to Inspect Element and this is what I see:

    <div style=”left: 496px; width: 475px; position: absolute; top: 98px;” class=”txtNew” id=”WRchTxtd-17bd” data-reactid=”.0.$SITE_ROOT.$desktop_siteRoot.$PAGES_CONTAINER.1.1.$SITE_PAGES.$c1a73_DESKTOP.1.$WRchTxtd-17bd”><p class=”font_8″ style=”font-size:28px; text-align:center;”>

    Not much help there from what I can discern so I guess my question is:  is there a way to tell PowerShell to just download/render the page as a browser would then I can parse it from there?

    • Author

      I guess the div box is filled by JavaScript. Where should PowerShell render the page? In the console? And why would that help with parsing? PowerShell creates objects of the HTML elements in the web page. However, PowerShell doesn’t understand JavaScript. I suppose you are better off with a web scraping tool that has a GUI.

  11. Steve Giovanni 6 years ago

    Michael thank you for your reply.  I was able to get a bit further, but it still isn’t working correctly for some reason.  Would you mind taking a look and letting me know if you see what I’m doing wrong?

    $site = Invoke-WebRequest -Uri “https://www.localfarefarmbagsouth.com/about_us”
    ($site.ParsedHtml.getElementsByTagName(‘p’) | Where {$_.className -eq ‘font_8’}).innerText

    • Author

      Try this:

      $site = Invoke-WebRequest -Uri "https://www.localfarefarmbagsouth.com/about_us"
      $site.content

      You will see there is no HTML between the body tags. This is all JavaScript. You need a scraping tool with an engine that can execute JavaScript.

  12. Steve Giovanni 6 years ago

    Understood, thanks!

  13. coltae 6 years ago

    Thank you so much for this page! It has helped me work on something I have been struggling with for days. Now,

    $WebResponse = Invoke-WebRequest https://www.ebay.com/posters

    $WebResponse.AllElements | ? { $_.Class -eq ‘price’ } | select innerText

    This does work and give me a list of, 10 items (lets say) but is there a way I can store one of the specific values, like always the 3rd in the list that populates, into a variable/array?

    THANK YOU 🙂

  14. Nick 6 years ago

    Excellent article!  I’m trying to pull a specific piece of text from a REST query and use it as a qualifier in a where clause for another PowerShell query… but I’ve hit a wall.  Any thoughts?

    $site = Invoke-WebRequest “https://services1.arcgis.com/bqfNVPUK3HOnCFmA/arcgis/rest/services/Traffic_Accident_Locations/FeatureServer/0/query?where=1=1&outFields=AccidentNumber&returnGeometry=false&orderByFields=AccidentNumber+DESC&resultRecordCount=1”

    $maxAccidentID = $site.AllElements | ? {$_.Class -eq ‘ftrTable’} | select td

    $url = “http://policeview.johnscreekga.gov/resource/fke7-a2vb.geojson?$where=”+ $maxAccidentID +”

    $filePath = ‘C:\temp\PoliceAccidents.geojson’

    Invoke-RestMethod -Uri $url -Method GET -OutFile $filePath

  15. Lakshmi Prabha 6 years ago

    Hi, Iam new to powershell scrripting and I need a powershell script which will display entire contents of webpage or last 100 lines of that webpage.. Please help

    • Author

      Try this: (wget google.com).content

      Displaying the last 100 lines is more tricky because there are many different ways to start a new line HTML.

      The other question is if it makes sense to “display” a web page with PowerShell because you usually want to parse a web page with a programming language.

  16. Premji Nitwal 6 years ago

    Hi, I need to transfer pcloud files to uptobox. But pcloud download links changes for each ip, that is why it’s link not working in remote url upload of any file hosting site.

    Does powershell software helps me in this matter? If yes, please tell me how to do it?

    • Premji Nitwal 6 years ago

      Please reply to this comment.

      Thank You.

      • Author

        I am not familiar with pcloud and uptobox. Perhaps the downlinks change according to a certain pattern that you use in your script? Depending on the number of users your organization has, you might consider instructing your users how they can download the files to a local drive and then upload them to the new provider.

  17. Premji Nitwal 6 years ago

    I am user at pcloud, so I want to transfer my files to uptobox.com.

    Pcloud has hotlinking protection, Do you know how to bypass it?

    My friend told me use fiddler & use webrequest. But I am new to this software. So can you help me please.

    Thank you.

    • Author

      I can’t help you because I don’t know Pcloud. I recommend that contact their support and ask how you can download your data to your PC.

  18. Premji Nitwal 6 years ago

    HELLO, I can download the data from PC, it will take huge time to upload from PC rather than REMOTE URL UPLOAD.

    1)please create a account (free one) at pcloud & uptobox. (you can use fake email id for register)

    2) upload a file in pcloud & click download , So it will give you download link, I am asking you to parse the download link.

    3)So it will give you get actual link to use in remote url upload of Uptobox.

    Did you understand my problem?

    Thank You.

  19. Author

    I can’t help you with that and I doubt that it is possible what are you planning.

  20. niranjan 5 years ago

    hi

    i m new  to this, bt i hv task to fetch each properties from web services/WSDL( which is in xml format).

    so please help me

  21. GT 5 years ago

    I am trying to create something similar to check for URL reputation using sitereview.bluecoat.com.

    How can you create a powershell to pull list of URL and check rating and return results?

  22. Wayne 4 years ago

    Hi,
    Thanks for the great information. I am monitoring a WEB site and one of the URLs opens a popup window. Is there a way for me to confirm that the popup has the required data. I can see the popup but I can’t find a way to get any of the elements that are in the popup.
    Thanks

    • Author

      Wayne, I guess the popup is JavaScript. You can’t parse this easily with PowerShell. You have to look at the source code of the web page and analyze the JavaScript code. The popup’s content might come from another URL. Maybe you can load this URL and then parse the content with PowerShell.

  23. Paul Gordon 4 years ago

    Back to the cookie popup problem… – if I use the -usebasicparsing switch on invoke-webrequest, the returned object does not even have the allelements or parsedhtml properties, which means I’ve really only got content to work with, and being a string isn’t making that easy…  if I omit the -usebasicparsing switch, I get those additional properties on the object, but absolutely every attempt to use them causes a cookie popup to appear, and no matter what I respond the command never seems to complete… – both ISE and PS console hang & I have to break out of it…

    Anyone have any idea why this happens, and how to stop it or work around it?

    FYI, I’m trying to obtain the full name of Exchange update rollups from https://docs.microsoft.com/en-us/exchange/new-features/build-numbers-and-release-dates?view=exchserver-2019#exchange-server-2010   where I have the long format build number, & I want to get the full product name…

    Cheers!

    Paul G.

     

  24. Panzerbjrn (Rank 2) 3 years ago

    Thanks for the useful article.

    How would you go about getting the content you receive into a variable, and then working with it from there?

    Specifically, I'm trying to get our Azure Billing Report, which is a CSV file, and then write it to Azure blob storage. As I am trying to do this from an Azure function, I don't really have the luxury of writing to a file first.

    When I do

    Invoke-WebRequest -Outfile Billing.csv

     everything is fine, and I get a nice CSV.

    When I do

    $BlobCSV = Invoke-WebRequest

    all I get is one big string of text.

    Any ideas?

    • Author

      I suppose you also added a URL to your command. What do you see if you just execute $BlobCSV? And what do you get with this:

      $BlobCSV.GetType()

      • Panzerbjrn (Rank 2) 3 years ago

        Thanks for taking the time to reply 🙂
        Yes, I do have a url, it was late and I forgot to include it here…

        $BlobCSV = Invoke-WebRequest -Uri $url -Headers $AuthHeaders
        $BlobCSV.GetType()
        
        IsPublic IsSerial Name                                     BaseType
        -------- -------- ----                                     --------
        True     False    WebResponseObject                        System.Object
        
        $BlobCSV.RawContent.GetType()
        
        IsPublic IsSerial Name                                     BaseType
        -------- -------- ----                                     --------
        True     True     String                                   System.Object
        
        $BlobCSV.Content.GetType()
        
        IsPublic IsSerial Name                                     BaseType
        -------- -------- ----                                     --------
        True     True     Byte[]                                   System.Array

        I guess these are all the types you'd expect.

        I guess this isn't helped by the billing report having two lines with text before the CSV columns start on line 3…

        • Author

          As you can see, the variable doesn't contain just a string, but a WebResponseObject. This means you can use this object to extract all the data you need.

          • Panzerbjrn (Rank 2) 3 years ago

            Well, yes, but the content is just one long string, hence my question…

            $BlobCSV.Content might technically be an array, but [0] has everything…

            • Author

              No, it is not just a long string. You can use object properties to parse the result. I gave a few examples in the article.

  25. M 3 years ago

    Hi, I'm trying to make a webrequest in powershell that contains a base64 and it should return a base64 in the response, but instead returns an xop. Through SOAPUI I solved by disabling the MTOM. Is there a parameter to force the base64 response?

    Thanks

Leave a reply

Please enclose code in pre tags

Your email address will not be published.

*

© 4sysops 2006 - 2023

CONTACT US

Please ask IT administration questions in the forums. Any other messages are welcome.

Sending

Log in with your credentials

or    

Forgot your details?

Create Account