Users rarely think about verifying screenshots of social media posts before sharing them on social media. This eventually leads to the spread of misinformation and disinformation. We are developing an automated tool to estimate the probability that a screenshot of a social media post is fake. In many cases, web archives can be used to validate the attribution of such screenshots.
Web Archives for Verifying Attribution in Twitter ScreenshotsTarannum Zaki
Users rarely think about verifying screenshots of social media posts before sharing them on social media. This eventually leads to the spread of misinformation and disinformation. We are developing an automated tool to estimate the probability that a screenshot of a social media post is fake. In many cases, web archives can be used to validate the attribution of such screenshots.
Extracting Information from Twitter ScreenshotsTarannum Zaki
Screenshots are prevalent on social media as a common approach for information sharing. Users rarely verify before sharing screenshots whether they are fake or real. Information sharing through fake screenshots can be highly responsible for misinformation and disinformation spread on social media. There are services of the live web and web archives that could be used to validate the content of a screenshot. We are going to develop a tool that would automatically provide a probability whether a screenshot is fake by using the services of the live web and web archives.
Challenges in Replaying Archived Twitter PagesKritika Garg
Historians and researchers rely on web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this study, we document and analyze the problems in archiving Twitter after Twitter switched to a new user interface (UI) in June 2020. Most web archives were unable to archive the new UI, resulting in archived Twitter pages displaying Twitter’s “Something went wrong” error. The challenges in archiving the new UI forced web archives to continue using the old UI. But, features such as Twitter labels were a part of the new UI, hence web archives archiving Twitter’s old UI would be missing these labels. To analyze the potential loss of information in web archival data due to this change, we used the personal Twitter account of the 45th President of the United States, @realDonaldTrump, which was suspended by Twitter on January 8, 2021. Trump’s account was heavily labeled by Twitter for spreading misinformation, however we discovered that there is no evidence in web archives to prove that some of his tweets ever had a label assigned to them. We also studied the possibility of temporal violations in archived versions of the new UI, which may result in the replay of pages that never existed on the live web. We also discovered that when some tweets with embedded media are replayed, portions of the rewritten t.co URL, which is meant to be hidden from the end-user, is partially exposed in the replayed page. Our goal is to educate researchers who may use web archives and caution them when drawing conclusions based on archived Twitter pages.
This content shows how to get Twitter geo-located data using QGIS (1. Installation of QGIS and Plugin 2. Twitter API application, and 3. Example of getting data from Twitter API).
The Next Big Thing is Web 3.0. Catch It If You Can Judy O'Connell
The best minds on our planet are suggesting that the Internet will continue to be arguably the most influential invention of our time. We are in the midst of a highly dynamic and dramatically changing landscape. Where Web 1.0 made us consumers of information, Web 2.0 allowed us to be participators and creators. Web 3.0 and the Semantic Web technologies are beginning to play a larger and more significant role in the search and filtering of the content fire hose that teachers and students encounter each day. How will the semantic web influence our learning and teaching encounters on the web? What is the connection between meaning and data? Will search or discovery be the main driving force in the 3.0 information revolution? How will information and knowledge creation in a semantic-powered online world develop? This session will draw on Semantic Web research and developments and show how connecting, collaborating and networking in a Web 3.0 world is changing the ground-rules once again.
Web Archives for Verifying Attribution in Twitter ScreenshotsTarannum Zaki
Users rarely think about verifying screenshots of social media posts before sharing them on social media. This eventually leads to the spread of misinformation and disinformation. We are developing an automated tool to estimate the probability that a screenshot of a social media post is fake. In many cases, web archives can be used to validate the attribution of such screenshots.
Extracting Information from Twitter ScreenshotsTarannum Zaki
Screenshots are prevalent on social media as a common approach for information sharing. Users rarely verify before sharing screenshots whether they are fake or real. Information sharing through fake screenshots can be highly responsible for misinformation and disinformation spread on social media. There are services of the live web and web archives that could be used to validate the content of a screenshot. We are going to develop a tool that would automatically provide a probability whether a screenshot is fake by using the services of the live web and web archives.
Challenges in Replaying Archived Twitter PagesKritika Garg
Historians and researchers rely on web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this study, we document and analyze the problems in archiving Twitter after Twitter switched to a new user interface (UI) in June 2020. Most web archives were unable to archive the new UI, resulting in archived Twitter pages displaying Twitter’s “Something went wrong” error. The challenges in archiving the new UI forced web archives to continue using the old UI. But, features such as Twitter labels were a part of the new UI, hence web archives archiving Twitter’s old UI would be missing these labels. To analyze the potential loss of information in web archival data due to this change, we used the personal Twitter account of the 45th President of the United States, @realDonaldTrump, which was suspended by Twitter on January 8, 2021. Trump’s account was heavily labeled by Twitter for spreading misinformation, however we discovered that there is no evidence in web archives to prove that some of his tweets ever had a label assigned to them. We also studied the possibility of temporal violations in archived versions of the new UI, which may result in the replay of pages that never existed on the live web. We also discovered that when some tweets with embedded media are replayed, portions of the rewritten t.co URL, which is meant to be hidden from the end-user, is partially exposed in the replayed page. Our goal is to educate researchers who may use web archives and caution them when drawing conclusions based on archived Twitter pages.
This content shows how to get Twitter geo-located data using QGIS (1. Installation of QGIS and Plugin 2. Twitter API application, and 3. Example of getting data from Twitter API).
The Next Big Thing is Web 3.0. Catch It If You Can Judy O'Connell
The best minds on our planet are suggesting that the Internet will continue to be arguably the most influential invention of our time. We are in the midst of a highly dynamic and dramatically changing landscape. Where Web 1.0 made us consumers of information, Web 2.0 allowed us to be participators and creators. Web 3.0 and the Semantic Web technologies are beginning to play a larger and more significant role in the search and filtering of the content fire hose that teachers and students encounter each day. How will the semantic web influence our learning and teaching encounters on the web? What is the connection between meaning and data? Will search or discovery be the main driving force in the 3.0 information revolution? How will information and knowledge creation in a semantic-powered online world develop? This session will draw on Semantic Web research and developments and show how connecting, collaborating and networking in a Web 3.0 world is changing the ground-rules once again.
Student Activities and Social Media: Twitter and FoursquarePaul Brown
This presentation provides an overview of Twitter and Foursquare and examines ways that Student Activities offices on college campuses can utilize. Originally presented to the Office of Student Programs at Boston College upon invitation.
Uncertainty in replaying archived Twitter pagesMichael Nelson
Michael L. Nelson
@phonedude_mln
with: Sawood Alam, Kritika Garg, Himarsha Jayanetti,
Shawn M. Jones, Nauman Siddique, Michele C. Weigle
@WebSciDL
Ethics and Archiving the Web: How to ethically collect and use web archives
2021-03-30
Our presentation for the May 5th Ignite event at Lisbon, dedicated to Portuguese technology.
http://igniteportugal.blogspot.com/2010/05/programa-ignite-portugal-tecnologico.html
Student Activities and Social Media: Twitter and FoursquarePaul Brown
This presentation provides an overview of Twitter and Foursquare and examines ways that Student Activities offices on college campuses can utilize. Originally presented to the Office of Student Programs at Boston College upon invitation.
Uncertainty in replaying archived Twitter pagesMichael Nelson
Michael L. Nelson
@phonedude_mln
with: Sawood Alam, Kritika Garg, Himarsha Jayanetti,
Shawn M. Jones, Nauman Siddique, Michele C. Weigle
@WebSciDL
Ethics and Archiving the Web: How to ethically collect and use web archives
2021-03-30
Our presentation for the May 5th Ignite event at Lisbon, dedicated to Portuguese technology.
http://igniteportugal.blogspot.com/2010/05/programa-ignite-portugal-tecnologico.html
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Web Archives for Verifying Attribution in Twitter Screenshots
1. Web Archives for Verifying Attribution in
Twitter Screenshots
Presented By:
Tarannum Zaki, PhD Student
Advisors: Dr. Michael L. Nelson & Dr. Michele C. Weigle
Department of Computer Science
Old Dominion University, Norfolk, Virginia
April 26, 2024
@tarannum_zaki @WebSciDL
2024 Web Science and Digital Libraries Research Group Expo
2. Screenshots are commonly used to annotate the social media of others
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
2
https://twitter.com/BetteMidler/status/1541472225341198338
https://twitter.com/MahyarTousi/status/1534307163073658881 https://twitter.com/urbanachievr/status/1505944201208516612
3. Why screenshots?
To use as an evidence for deleted posts
3
https://web.archive.org/web/20220525125749/https://twitter.com/DanielDefense/status/1526237750277681154
Controversial posts
may be deleted.
https://twitter.com/ashtonpittman/status/1530243294868930560
https://twitter.com/DanielDefense/status/1526237750277681154
Other reasons: To deny cross-platform engagement, to aggregate, to mark-up etc.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
4. Did they really post that?
Screenshots can also be used for humor, satire, and disinformation
4
https://twitter.com/Shayan86/status/1515753937139388418
https://twitter.com/paulthacker11/status/1495436489492090881
https://twitter.com/elonmusk/status/1544051155562598401
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
5. Creating fake tweets using Tweetgen
5
https://www.tweetgen.com/
https://www.tweetgen.com/create/tweet.html
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
6. Using the live web and web archives to validate attribution of
screenshots
6
https://www.google.com/search
https://archive.org/web/
https://www.reuters.com/
https://www.snopes.com/
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
7. Motivation
➢ Fake tweets can be responsible for misinformation/disinformation spread.
➢ Fake tweets are easy to create using online tools.
➢ There are no tools currently available to evaluate the authenticity of
attribution of screenshots.
7
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
8. Aim
To develop a tool that would automatically provide a probability
whether screenshot of a social media post was actually posted by the
alleged author using the services of live web and web archives.
8
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
9. To search for a tweet in the Wayback Machine, you must first
know its URL
9
https://web.archive.org/web/20220323185843/https://twitter.com/annaturley/status/1506706947239817224
URL of the tweet:
https://twitter.com/annaturley/status/1506706947239817224
https://web.archive.org/
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
10. But, URL of a tweet is not present in most screenshots
10
https://twitter.com/AaronBastani/status/1507391218854117377
@annaturley
March 23, 2022
March 25, 2022
https://twitter.com/TWITTER_HANDLE/status/TWEET_ID
https://web.archive.org/web/20220323185843/https://twitter.com/annaturley/status/1506706947239817224
Tweet ID encodes the timestamp of when
the tweet was created
Construction of a tweet URL
- Use the Twitter handle and approximate a time window based
on the timestamp.
- Construct URL for the tweet.
- Search for the tweet in the Wayback Machine using the URL.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
11. Verifying if screenshot exists in the Wayback Machine
11
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
12. Creating a dataset of screenshots collected from Twitter
12
Fields
Shared post’s URL Original post’s URL
Category Reason
Content category Structural features
Post type Social media
Search strategy Annotated images
Screenshot Remarks
- Screenshot images shared on Twitter.
- 200 examples
- Examples include both real and fake screenshots
https://ws-dl.blogspot.com/2022/12/2022-12-12-disinformation-spread-on.html
https://twitter.com/rvawonk/status/1503227687917305863
https://twitter.com/RealCandaceO/status/1501576
352587292673
Category: Real
Reason: Found in the live web
Content category: Politics
Post Type: Tweet
Structural features: Single author, single
post
Search strategy: Searched on Twitter
interface
Social media: Twitter
Original post’s URL
Shared post’s URL
Screenshot
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
13. OCRing screenshots: Single tweet images
13
OCR
Optical Character Recognition extracts information as text from digital image.
Example screenshot image OCR extracted output
Twitter Handle
Timestamp
Tweet Text
Zaki, T., Nelson, M.L., and Weigle, M.C. (2023, Jun 14). Extracting Information from Twitter Screenshots. Tech Report arXiv:2306.08236. https://doi.org/10.48550/arXiv.2306.08236
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
14. Computing a time window based on the screenshot timestamp
14
The maximum difference between two time zones on Earth is 26 hours.
Example screenshot image OCR extracted output
Twitter handle and computed timestamps
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
15. Using CDX API to retrieve archived tweets with left hand boundary
15
request = "http://web.archive.org/cdx/search/cdx?url=" + urir + params
urir = "https://twitter.com/"+randyhillier+"/status"
params = "&matchType=prefix&from="+20220218154100
CDX API prefix search process
Twitter handle and computed timestamps
Output: Retrieved archived tweets with the left hand boundary(cropped).
https://archive.org/help/wayback_api.php
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
16. Extracting tweet IDs and determining tweet creation
timestamp using TweetedAt
16
https://web.archive.org/web/20220222163926/https://twitter.com/randyhillier/status/1006984708109099008
https://ws-dl.blogspot.com/2019/08/2019-08-03-tweetedat-finding-tweet.html
Each tweet ID encodes its
creation timestamp
An archived tweet’s URL
https://oduwsdl.github.io/tweetedat/#1006984708109099008
Tweet ID Tweet Creation Date
1006984708109099008 20180613194037
………… …………..
Mapping between all the tweet IDs and
tweet creation timestamps
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
17. Determining the final set of archived tweets by filtering the
tweet creation timestamps within the time window
17
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
Output: 917 archived tweets with left hand boundary (cropped)
Mapping between tweet ID and
tweet creation timestamp
Output: 29 archived tweets within 52 hours time window (cropped)
Creation timestamp of
tweets which does not
fall within the 52 hours
time window are filtered
out.
449 archived tweets
Multiple mementos are
filtered out.
29 archived tweets
18. Extracting tweet text from archived tweets using
BeautifulSoup and Selenium
18
https://web.archive.org/web/20220220024223/https://twitter.com/randyhillier/status/1495226962058649603
TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text
An archived tweet’s URL
Extracted text from archived tweet
HTML tag containing
the tweet text
https://www.selenium.dev/
https://pypi.org/project/beautifulsoup4/
Selenium automates web scraping and BeautifulSoup parses text from HTML.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
19. Computing text similarity score between tweet text from
screenshot and archived tweets using Python’s difflib library
19
https://docs.python.org/3/library/difflib.html
Example screenshot image Extracted text from archived tweet Extracted tweet text from screenshot
match_score(Archived_Tweet_Text, Screenshot_Tweet_Text)= 81.40%
Text similarity score is computed based on longest common subsequence
Archived_Tweet_Text1 Screenshot_Tweet_Text match _score = 81.40%
Archived_Tweet_Text2 Screenshot_Tweet_Text match_score = 30.78%
Archived_Tweet_Text3 Screenshot_Tweet_Text match_score = 5.67%
……………..
A match score of 81.40% helps us to prove the existence of the screenshot tweet posted by the alleged
author.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
20. A threshold of 60% produced the highest F1 (0.69)
20
Threshold Value Precision Recall F1 Score
90% 1.00 0.42 0.59
80% 1.00 0.49 0.66
70% 1.00 0.51 0.67
60% 1.00 0.53 0.69
Experimented on 108 single tweet images from the collected dataset.
Performance of the overlap between the tweet text from the
screenshot and the archived tweets.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
21. Limitations & Future Work
21
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
OCR
Complex screenshot images Extracted output mostly results in
garbage value.
22. Summary
22
➢ Screenshots are an easy way to share content on social media.
➢ Since screenshots can be easily faked, it is a critical task to detect a fabricated post.
➢ Services of web archives could be useful to verify attribution of a screenshot by finding
an archived version of the screenshot content.
➢ Our research will mitigate misinformation and disinformation spread on social media.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL