PowerShell’s Invoke-WebRequest is a powerful cmdlet that allows you to download, parse, and scrape web pages.

Michael Pietroforte

Michael Pietroforte is the founder and editor of 4sysops. He is a Microsoft Most Valuable Professional (MVP) with more than 30 years of experience in IT management and system administration.

In a previous post, I outlined the options you have to download files with different Internet protocols. You use Invoke-WebRequest to download files from the web via HTTP and HTTPS. However, the cmdlet enables you to do much more than just download files; you can use it to analyze to contents of web pages and use the information in your scripts.

The HtmlWebResponseObject object ^

If you pass a URI to Invoke-WebRequest, it won’t just display the HTML code of the web page. Instead, it will show you formatted output of various properties of the corresponding web request. For example:

Storing HtmlWebResponseObject in a variable

Storing HtmlWebResponseObject in a variable

Like most cmdlets, Invoke-WebRequest returns an object. If you execute the object’s GetType method, you will learn that the object is of the type HtmlWebResponseObject.

As usual, you can pipe the object to Get-Member to get an overview of the object’s properties:

Parse an HTML page ^

Properties such as Links or ParsedHtml indicate that the main purpose of the cmdlet is to parse web pages. If you just want to access the plain content of the downloaded page, you can do so through the Content property:

There also is a RawContent property, which includes the HTTP header fields that the web server returned. Of course, you can also only read the HTTP header fields:

Headers of a web request

Headers of a web request

It may also be useful to have easy access to the HTTP response status codes and their descriptions:

The Links property is an array of objects that contain all the hyperlinks in the web page. The most interesting properties of a link object are innerHTML, innerText, outerHTML, and href.

The URL that the hyperlink points to is stored in href. To get a list of all links in the web page, you could use this command:

Displaying a web page’s links

Displaying a web page’s links

outerHTML refers to the entire link as it appears together with the <a> tag: <a href="http://contoso.com">Contoso</a>. Of course, other elements can appear here, such as additional attributes of the <a> element or additional HTML elements after the start tag (<a>), such as image tags. In contrast, the innerHTML property only stores the content between the start tag and the end tag (</a>) together with enclosed additional HTML elements.

The innerText property strips all HTML code from the innerHTML property. You can use this property to read the anchor text of a hyperlink. However, if the additional HTML elements exist inside the <a> element, you will get the text between those tags as well.

Note that the Link object also has an outerText property, but its contents will always be identical to the innerText property if you read a web page. The difference between outerText and innerText only matters if you write HTML code, which we don’t do here.

The Image property can be handled in a similar way as the Link property. It, of course, does not contain the images. Instead, it stores objects with properties that contain HTML code that refers to the images. The most interesting properties are width, height, alt, and src. If you know a little HTML, you will know how to deal with these attributes.

The following example downloads all images from a web page:

$WebResponse.Images stores an array of image objects from where we extract the src attribute of the <img> element, which refers to the location of the image. With the help of the Split-Path cmdlet, we get the file name from the URL, which we use to store the image in the current folder.

The properties that you see when you pipe an HtmlWebResponseObject object to Get-Member are those that you need most often when you have to parse an HTML page. If you are looking for other HTML elements, you can use the AllElements and ParsedHTML properties.

AllElements (you guessed it already) contains all the HTML elements that the page contains:

Of course, this also includes <a> and <img> elements, which means that you can also access them through the AllElements property. For instance, the command below, which displays all the links in a web page, is a bit more longwinded alternative to $WebResponse.links:

ParsedHTML gives you access to the Document Object Model (DOM) of the web page. One difference from AllElements is that ParsedHTML also includes empty attributes of HTML elements. More interesting is that you can easily retrieve additonal information about the web page. For example, the following command tells you when the page was last modified:

Determining when a web page was last modified

Determining when a web page was last modified

Submit an HTML form ^

Invoke-WebRequest also allows you to fill out form fields. Many websites use the HTTP method GET for forms, in which case you simply have to submit a URL that contains the form field entries. If you use a web browser to submit a form, you usually see how the URL is constructed. For instance, the next command searches for PowerShell on 4sysops:

If the website uses the POST method, things get a bit more complicated. The first thing you have to do is find out which method is used by displaying the forms objects:

Displaying the forms in a web page

Displaying the forms in a web page

A web page sometimes has multiple forms using different methods. Usually you recognize the form you need by inspecting the Fields column. If the column is cut off, you can display all the form fields with this command:

Let’s have a look at a more concrete example. Our goal is to scrape the country code of a particular IP address from a Whois website. We first have to find out how the form field is structured. Because we are working on the PowerShell console, it is okay to use the alias of Invoke-WebRequest:

Determining the form field of a Whois website

Determining the form field of a Whois website

We see that the website uses the POST method, that the URL to be called to process the query is https://who.is/domains/search, and that two form fields are required. The default value of the Search_type field is “Whois” and the query field is most likely the field for the IP address. We are now ready to scrape the country code of the IP address from the result page:

Update: The example no longer works because the web page uses a different form field now. You can use field variable now:

In the first line, we define a hash table that contains the names of our two form fields and the values we want to submit. In line 2, we store the result of the request page of the query in a variable. The web page returns the result within a <pre> element, and we extract its content in the next line.

We then use the -match operator with a regular expression to search for the country code. “\s+" matches any white space character, and “\w{2}” is supposed to match the country code, which consists of two characters. The parentheses group the country code, which allows us to access the result through the automatic variable $Matches.

Win the monthly 4sysops member prize for IT pros

Share
1+

Related Posts

33 Comments
  1. Kris 2 years ago

    Good stuff. Note that I got a message asking me to accept cookies every time I tried to do anything with the page contents (e.g. searching for a tag). Got round this by using -UseBasicParsing on Invoke-WebRequest which uses Powershell's built in parser rather than Internet Explorer's.

    I used this to build a proof of concept to download Dilbert strips from the archive - download a page, find the appropriate image tag, download that image, add 1 to the date and do the same. Obviously not using it to download en masse, probably get blocked for that but very pleased it worked 🙂

    0

  2. Michael Pietroforte 2 years ago

    Kris, thanks. I think I saw the cookie request only once. Maybe this is an IE setting? As to downloading en masse, you have no idea how many crawlers are out there and it is really hard to block them. Every minute or so another crawler hits 4sysops.

    0

  3. Schorschi 2 years ago

    If not a webpage, but a file for download, how would you get the file information without actually downloading the file contents?  Web response method will actually pull the entire file, in effect downloading or reading the file in total when all that is desired is just the file information, like size of the file.

    0

    • Author
      Michael Pietroforte 2 years ago

      The file properties are stored in the filesystem on the host. Web servers usually don't transmit this information. So if you want to read the file metadata without downloading the file, you need an API on the host that offers this data.

      If the remote host is a Windows machine you can use PowerShell remoting to read the file size:

      invoke-command -computername RemoteComputerName -scriptblock {(get-item c:\windows\notepad.exe).length}

      0

  4. Caroline 1 year ago

    Thanks for the informative article on Invoke-WebRequest. Just what I was looking for.

    0

  5. Oleg 1 year ago

    I need to download file from https://raw.githubusercontent.com/h5bp/html5-boilerplate/master/src/index.html, modifie it (for example add some tags... bootstrapGridSystem.css). In powershell it looks like:
    $results = irm -uri "https://raw.githubusercontent.com/h5bp/html5-boilerplate/master/src/index.html"
    $html = $results.ParsedHtml
    But how can I modifie the object? Is is possible to modify it?
    For example add <link href="bootstrapGridSystem.css">  and <link href="foundationGridSystem">
    Because this code:
    $linkBoot=$html.createElement("link")
    $linkBoot = "css/bootstrapGridSystem.css"
    $headTag=$html.getElementsByTagName("head")[0]
    $headTag.appendChild("link") didn't modify opbject $results.content?

    2+

  6. tejanagios 1 year ago

    HI,

    I am using your script and leveraging it to download image file from a list of URLS; The Script loops through each URL and invokes a web request and downloads images from it. The problem that i am facing is the images are by default getting downloaded in 320X240; where as on the actual site the image when opened in a new tab and right click downloaded, gives me a 960X720 pix file, which is what i am after.

    here is the script.

     

    $url = get-content "urls.txt"

    $j = $url.count

    for ($i= 0 ; $i -le $j ; $i++)

    {

    $WebResponse = Invoke-WebRequest -uri $url[$i]

    ForEach ($Image in $WebResponse.Images)

    {

    $FileName = Split-Path $Image.src -Leaf

    $d =  Invoke-WebRequest $Image.src

    }

    }

     

    1+

    • Author

      The problem is that the src attribute of the image tag only points to the image that you see on the web page. The URL of the image that is displayed when you click an image is in an a tag before the image tag. Thus, you have to retrieve all links in the web page (as explained in the article) and then get all URLs that point to images. Those URLs all have image extensions such as .jpg or .png. You could work with a regular expression to sort out these URLs.

      0

  7. teja 1 year ago

    Thank you, after looping through all the URLS i've got the final output.

    here is the working code; albiet it can be improved

    $source = Invoke-WebRequest -uri "<enter URL here>" `

    | Select-Object -ExpandProperty links | Select-Object -ExpandProperty href | Select-String "part"

    $j = $source.count

    for($k = 0 ; $k -le $j ; $k++)

    {

    #write-host $source[$k].Line

    $links = Invoke-WebRequest -uri $source[$k].Line `

    | Select-Object -ExpandProperty links | Select-Object -ExpandProperty href | Select-String ".PNG"

    foreach ($link in $links)

    {

    $filename =  Split-Path $link.line -Leaf

    Invoke-WebRequest -uri $link.Line -OutFile "C:\users\admin\Desktop\images\$k$filename"

    }

     

    0

    • Author

      Thanks for sharing. If the web page not only contains links to PNGs but also to JPGs, you could use this: 

      0

  8. ron 12 months ago

    Does this still work?  Perhaps the website has changed its search/query methods. I am not seeing a <pre> tag.

    0

    • Author
      Michael Pietroforte 12 months ago

      The code does no longer work because they changed the form field. This should work now:

      $Fields = @{"searchString" = "134.170.185.46"}

      $WebResponse = Invoke-WebRequest -Uri "https://who.is/domains/search" -Method Post -Body $Fields

      $Pre = $WebResponse.AllElements | Where {$_.TagName -eq "pre"}

      If ($Pre -match "Country:\s+(\w{2})")

      {

      Write-Host "Country code:" $Matches[1]

      }

      0

  9. Steve Giovanni 6 months ago

    I was trying to use this on a website I visit to see a list they post there weekly.  There is no RSS feed or anything so you have to manually go to the site. I thought it would be fun to automate scraping the weekly options and emailing them to myself, which is where your article came in very handy, thank you!

    The problem is while I can pull back the URL, it looks like they are embedding the stuff I want not in the actually page, but they are pulling it in from a frame (I think).

    I'm decent at basic PowerShell scripting but haven't looked at HTML since the late 90s. Any ideas?  I also tried RawContent and AllElements to no avail.

    0

    • Author
      Michael Pietroforte 6 months ago

      If it is an iframe, you can just load the iframe's URL. In most browsers, you can right-click the element in the web page and then click "Inspect." You should then be able to see the URL where the content that interests you is coming from.

      0

  10. Steve Giovanni 6 months ago

    I tried to Inspect Element and this is what I see:

    <div style="left: 496px; width: 475px; position: absolute; top: 98px;" class="txtNew" id="WRchTxtd-17bd" data-reactid=".0.$SITE_ROOT.$desktop_siteRoot.$PAGES_CONTAINER.1.1.$SITE_PAGES.$c1a73_DESKTOP.1.$WRchTxtd-17bd"><p class="font_8" style="font-size:28px; text-align:center;">

    Not much help there from what I can discern so I guess my question is:  is there a way to tell PowerShell to just download/render the page as a browser would then I can parse it from there?

    0

    • Author
      Michael Pietroforte 6 months ago

      I guess the div box is filled by JavaScript. Where should PowerShell render the page? In the console? And why would that help with parsing? PowerShell creates objects of the HTML elements in the web page. However, PowerShell doesn't understand JavaScript. I suppose you are better off with a web scraping tool that has a GUI.

      0

  11. Steve Giovanni 6 months ago

    Michael thank you for your reply.  I was able to get a bit further, but it still isn't working correctly for some reason.  Would you mind taking a look and letting me know if you see what I'm doing wrong?

    $site = Invoke-WebRequest -Uri "https://www.localfarefarmbagsouth.com/about_us"
    ($site.ParsedHtml.getElementsByTagName('p') | Where {$_.className -eq 'font_8'}).innerText

    0

    • Author
      Michael Pietroforte 6 months ago

      Try this:

      You will see there is no HTML between the body tags. This is all JavaScript. You need a scraping tool with an engine that can execute JavaScript.

      0

  12. Steve Giovanni 6 months ago

    Understood, thanks!

    1+

  13. coltae 6 months ago

    Thank you so much for this page! It has helped me work on something I have been struggling with for days. Now,

    $WebResponse = Invoke-WebRequest https://www.ebay.com/posters

    $WebResponse.AllElements | ? { $_.Class -eq 'price' } | select innerText

    This does work and give me a list of, 10 items (lets say) but is there a way I can store one of the specific values, like always the 3rd in the list that populates, into a variable/array?

    THANK YOU 🙂

    0

    • Author
      Michael Pietroforte 6 months ago

      I am getting a "page not found" with your URL. Can you provide a public page for your example?

      0

  14. Nick 6 months ago

    Excellent article!  I'm trying to pull a specific piece of text from a REST query and use it as a qualifier in a where clause for another PowerShell query... but I've hit a wall.  Any thoughts?

    $site = Invoke-WebRequest "https://services1.arcgis.com/bqfNVPUK3HOnCFmA/arcgis/rest/services/Traffic_Accident_Locations/FeatureServer/0/query?where=1=1&outFields=AccidentNumber&returnGeometry=false&orderByFields=AccidentNumber+DESC&resultRecordCount=1"

    $maxAccidentID = $site.AllElements | ? {$_.Class -eq 'ftrTable'} | select td

    $url = "http://policeview.johnscreekga.gov/resource/fke7-a2vb.geojson?$where="+ $maxAccidentID +"

    $filePath = 'C:\temp\PoliceAccidents.geojson'

    Invoke-RestMethod -Uri $url -Method GET -OutFile $filePath

    0

  15. Lakshmi Prabha 4 months ago

    Hi, Iam new to powershell scrripting and I need a powershell script which will display entire contents of webpage or last 100 lines of that webpage.. Please help

    0

    • Author
      Michael Pietroforte 4 months ago

      Try this: (wget google.com).content

      Displaying the last 100 lines is more tricky because there are many different ways to start a new line HTML.

      The other question is if it makes sense to "display" a web page with PowerShell because you usually want to parse a web page with a programming language.

      0

  16. Premji Nitwal 4 months ago

    Hi, I need to transfer pcloud files to uptobox. But pcloud download links changes for each ip, that is why it's link not working in remote url upload of any file hosting site.

    Does powershell software helps me in this matter? If yes, please tell me how to do it?

    1+

    • Premji Nitwal 4 months ago

      Please reply to this comment.

      Thank You.

      0

      • Author
        Michael Pietroforte 4 months ago

        I am not familiar with pcloud and uptobox. Perhaps the downlinks change according to a certain pattern that you use in your script? Depending on the number of users your organization has, you might consider instructing your users how they can download the files to a local drive and then upload them to the new provider.

        0

  17. Premji Nitwal 4 months ago

    I am user at pcloud, so I want to transfer my files to uptobox.com.

    Pcloud has hotlinking protection, Do you know how to bypass it?

    My friend told me use fiddler & use webrequest. But I am new to this software. So can you help me please.

    Thank you.

    0

    • Author
      Michael Pietroforte 4 months ago

      I can't help you because I don't know Pcloud. I recommend that contact their support and ask how you can download your data to your PC.

      1+

  18. Premji Nitwal 4 months ago

    HELLO, I can download the data from PC, it will take huge time to upload from PC rather than REMOTE URL UPLOAD.

    1)please create a account (free one) at pcloud & uptobox. (you can use fake email id for register)

    2) upload a file in pcloud & click download , So it will give you download link, I am asking you to parse the download link.

    3)So it will give you get actual link to use in remote url upload of Uptobox.

    Did you understand my problem?

    Thank You.

    0

  19. Author
    Michael Pietroforte 4 months ago

    I can't help you with that and I doubt that it is possible what are you planning.

    1+

  20. niranjan 1 month ago

    hi

    i m new  to this, bt i hv task to fetch each properties from web services/WSDL( which is in xml format).

    so please help me

    0

  21. GT 1 month ago

    I am trying to create something similar to check for URL reputation using sitereview.bluecoat.com.

    How can you create a powershell to pull list of URL and check rating and return results?

    0

Leave a reply

Your email address will not be published. Required fields are marked *

*

CONTACT US

Please ask IT administration questions in the forum. Any other messages are welcome.

Sending
© 4sysops 2006 - 2017

Log in with your credentials

or    

Forgot your details?

Create Account