Overview: What's Artificial intelligence? Artificial Intelligence (AI) is about making computers do tasks that normally need human thinking. AI has many different parts, each focusing on different things. One important part is called machine learning. This helps computers learn from data and get better at their jobs over time. Another part is natural language processing, which helps computers understand and talk to people using human language. There is also computer vision, where computers learn to see and understand pictures or videos. Robotics is another area where AI is used to make smart machines that can move and work in the real world. What's Computer Vision? Computer vision is a part of both AI and computer science that helps computers understand and make sense of what they see, like how humans do with their eyes. It involves creating special programs and systems that can look at pictures and videos and figure out important details. This allows machines to "see" and get useful information from images, just like we do when we look at the world around us. How Computer Vision works? Computer vision helps computers understand and interpret images or videos, similar to how humans see and process visual information. Here's how it works in simpler steps: Image Acquisition: First, the computer gets images or videos from sources like cameras, drones, or even satellites. Preprocessing: After getting the images, they are cleaned and improved. This step can include: Noise Reduction: Removing unwanted parts or errors in the image. Image Enhancement: Adjusting brightness, contrast, and sharpness to make the image clearer. Normalization: Making sure the lighting is even across the image. Resizing and Cropping: Changing the size of the image and focusing on important areas. Feature Extraction: The computer looks for important details in the image, like edges, shapes, colors, or patterns. This helps the computer understand what is in the picture. Feature Representation: These details are then turned into a format the computer can use to make decisions, like a set of numbers or patterns. Machine Learning and Deep Learning: To make the computer smarter, models like Convolutional Neural Networks (CNNs) are trained to recognize objects, classify images, or detect patterns. These models learn from exampTask-Specific Processing: Object Detection: Finding and pointing out objects in an image or video. Image Classification: Giving a label to an image based on what's in it (like "cat" or "car"). Image Segmentation: Breaking an image into meaningful parts, like separating the sky from buildings. Face Recognition: Identifying people by analyzing their facial features. Motion Tracking: Watching how objects move in videos. Postprocessing: After making predictions, the results may be polished by removing mistakes or improving boundaries between objects. Visualization and Interpretation: The computer shows the results in an understandable way, like highlighting detected objects or providing descriptions. Feedback Loop: In some cases, like self-driving cars, the computer vision system gives feedback to control the vehicle's movements, like steering or braking. These steps vary depending on the specific task and the complexity of the images or videos. Computer Vision: Application Computer vision is used in many industries because it helps computers understand and analyze images and videos. Here are some simple examples of how it’s used: Healthcare: Doctors use computer vision to look at medical images (like X-rays) to find diseases early. It also helps robots assist in surgeries. Cars: Self-driving cars use computer vision to see the road, avoid obstacles, and read traffic signs. It also helps with parking and safety features like automatic braking. Stores: Cameras track what customers do in stores, helping to improve the shopping experience. Some stores use it for self-checkout, where customers don’t have to go to a cashier. Factories: Computer vision checks products for mistakes and helps robots build things. It also monitors machines to prevent them from breaking down. Farming: Farmers use drones to check the health of their crops. Robots with computer vision can also pick fruits and vegetables when they are ripe. Security: Cameras with computer vision recognize people’s faces and detect suspicious activities in public places to keep things safe. Entertainment: Apps like Snapchat use computer vision to add fun filters to your face. Video games and sports also use it to track players' movements. In our lesson, we will look at a simple hand-tracking application. This means we will learn how to make the computer recognize and follow the movements of your hand in real-time using computer vision. Hand-Tracking: using mediapipe The MediaPipe Hand Landmarker is a tool that helps detect important points on the hands in a picture or video. You can use it to find where the hand is and even add effects, like drawing on the hand in real-time. This tool works with images or videos using a machine learning model. It can tell: The key points of the hand (like fingers and palm) in the image. The hand's position in 3D space (world coordinates). Whether the hand is a left or right hand. This is useful for applications like hand tracking, games, and adding effects to hand movements What's about mediapipe machine learning Models? The Hand Landmarker uses two models together to work properly: Palm Detection Model: This model helps find the palm in the image. Hand Landmarks Detection Model: Once the palm is found, this model detects the key points on the hand, like fingers and the palm's shape. Both models are packaged in a model bundle, and you need this bundle to make the hand detection task work. The hand landmark model bundle can find 21 important points (keypoints) on the hand, like knuckles and finger joints, inside the detected hand area. This model was trained using about 30,000 real-world hand images, along with computer-made (synthetic) hand models placed on different backgrounds. This helps the model accurately detect hands in various situations. Finding hands using the palm detection model can take a bit of time, especially when using videos or live streams. To make this faster, the Hand Landmarker looks at the area where it found the hand in the last frame to find hands in the next frames. It will only run the palm detection model again if it stops seeing the hands or cannot follow them anymore. This helps the system work faster and use less time to find hands. "Tell me and I forget. Teach me and I remember. Involve me and I learn." — Benjamin Franklin Let's practice using practical example: In this project, we will use the Python programming language to do our hand tracking. To get started, we need a camera to capture video of our hands (you can use your laptop camera). We will also need some basics of procedural programming in Python, as we won't be using object-oriented programming (OOP) and classes to keep it simple for beginners. We will set up our work environment using PyCharm Community Edition, which is free to use. This will help us write and run our Python code easily as we work on hand tracking Setting Up our work environment: first go to Google and search for Pycharm ide and click the PyCharm Community Edition from software releases page scroll down please and click the second column download button it will get you on the download page scroll down till you find PyCharm Community Edition and download exe for you operating system (for me Windows) The download will begin shortly. Please wait for it to finish, and then install it like you would with any other software. Creating Hand-Tracking project: After you install PyCharm IDE all you need is to open the ide from the icon in your desktop or from the start menu and create a project First click "New Project" to create new one: Second make sure to rename your project and save it in your path, for python version please choose 3.8.10(if it is higher you can downgrade by downloading 3.8.10 from python website, it's an exe file you can install it easily) Wait until the project open with you Install mediapipe: This will give us ability to use the two models we talked about earlier. you will wait till the green message appear which it tells you the package is installed Writing and understanding code: Step 1: The first thing we need to do is import the necessary packages into our Python code. Earlier, we installed the MediaPipe package, which includes all the machine learning models we need for our hand tracking. Now, let's import this package so we can start using it in our Python code: import cv2 # For handling the camera and images import mediapipe as mp # For hand tracking using MediaPipe The mediapipe package will allow us to use the hand tracking models directly in our program. We'll use cv2 (OpenCV) to access our camera and display the video where the hand tracking will take place. Step 2: After importing the packages, we can now use cv2 to capture an image, video, or open your webcam. Here's the code to open the webcam, set the video resolution, and display the video feed: import cv2 # Open the webcam (0 is for the default camera) cap = cv2.VideoCapture(0) # Set the width and height of the video feed cap.set(3, 640) # Set width cap.set(4, 480) # Set height # Loop to continuously capture frames while True: success, image = cap.read() # Capture a frame from the webcam if not success: print("Error: Failed to capture image") break # Convert the image from BGR (default format in OpenCV) to RGB image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # Display the video feed with the title 'hand tracking' cv2.imshow('hand tracking', image_rgb) # Exit the loop when 'q' is pressed cv2.waitKey(1) Explanation: cap = cv2.VideoCapture(0): Opens your webcam for video capture. 0 refers to the default webcam. cap.set(3, 640) and cap.set(4, 480): These lines set the width to 640 and height to 480 for the video frame. success, image = cap.read(): Reads each frame from the webcam. If it captures successfully, success will be True, and the captured frame will be stored in image. image_RGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB): Converts the frame from BGR (OpenCV's default format) to RGB. But what's the purpose of this line of code? Answer is: OpenCV uses BGR (Blue-Green-Red) as its default format for images and video frames, while most other libraries, like MediaPipe, and many machine learning models, expect images in RGB (Red-Green-Blue) format. The reason we convert the frame from BGR to RGB is because: Compatibility: If we are using a machine learning model, like MediaPipe for hand tracking, these models typically expect the input image to be in RGB format. Accuracy: The colors in BGR are in a different order than RGB. If we don't convert the image, the model or other libraries may interpret the colors incorrectly, which could lead to inaccurate results. In short, the conversion is necessary to ensure that the image is processed correctly by models that expect RGB format. cv2.imshow('hand tracking', image_RGB): Displays the RGB frame in a window titled "hand tracking." cv2.waitKey(1): Keeps the window open and waits for a key press. The window will keep refreshing every 1 millisecond, showing the next frame. At the end you should have this output: [All source code will be uploaded at the end of this article] Step 3: A landmark in hand tracking is a specific point on the hand that helps to identify its shape and position. For example, when we talk about the hand, landmarks include: The tips of the fingers (like the end of the thumb, index finger, etc.). The joints between the fingers. The center of the palm. In total, the hand landmark model detects 21 important points on the hand. These points help the computer understand where the hand is and how it is moving. Why are landmarks important? They help to recognize different hand shapes and positions. They allow us to track the hand's movement in real time. We can use these points to add effects or to control games and applications using hand gestures. So, landmarks are like markers that help us know exactly where important parts of the hand are. In this step, we are starting to use our hand landmark detection models from the MediaPipe package. Here’s a breakdown of how this code works: import cv2 import mediapipe as mp # Open the webcam (0 is for the default camera) cap = cv2.VideoCapture(0) # Set the width and height of the video feed cap.set(3, 640) # Width cap.set(4, 480) # Height # Initialize the MediaPipe Hands model mp_hands = mp.solutions.hands hands = mp_hands.Hands() # This sets up the hand tracking model # Start the loop to capture video and process hand landmarks while True: success, image = cap.read() # Capture a frame from the webcam if not success: print("Failed to capture image") break # Convert the image from BGR to RGB (required for the model) image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # Process the image to detect hand landmarks results = hands.process(image_rgb) # Print the detected hand landmarks, if any print(results.multi_hand_landmarks) # Display the original image cv2.imshow("hand tracking", image_rgb) # Exit the loop when 'q' is pressed cv2.waitKey(1) Explanation: mp.solutions.hands and Hands(): We are using MediaPipe’s Hands solution to detect hand landmarks. This model identifies key points (like joints) on the hand. results = hands.process(image_rgb): We pass the RGB frame to the model for hand landmark detection. The process() method will return results containing information about the detected hand landmarks, if any. print(results.multi_hand_landmarks): This prints the detected hand landmarks (21 points) if hands are found. The landmarks include the tips of the fingers, joints, and the palm's base. Displaying the video feed: The video feed is still displayed, but without drawing the detected landmarks yet. We'll add that soon to visualize the results. Output : This code sets up the basic structure to detect hands. The next step would be to draw these landmarks: import cv2 import mediapipe as mp cap = cv2.VideoCapture(0) cap.set(3, 640) cap.set(4, 480) mp_hands = mp.solutions.hands hands = mp_hands.Hands() mp_draw = mp.solutions.drawing_utils while True: success, image = cap.read() if not success: print("Failed to capture image") break image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) results = hands.process(image_rgb) # print(results.multi_hand_landmarks if results.multi_hand_landmarks: for hand_landmarks in results.multi_hand_landmarks: mp_draw.draw_landmarks(image_rgb, hand_landmarks) cv2.imshow("hand tracking", image_rgb) cv2.waitKey(1) Output: To connect those points together just add in this line of code mp_draw.draw_landmarks(image_rgb, hand_landmarks) : mp_hands.HAND_CONNECTIONS , so the code will look like: import cv2 import mediapipe as mp cap = cv2.VideoCapture(0) cap.set(3, 640) cap.set(4, 480) mp_hands = mp.solutions.hands hands = mp_hands.Hands() mp_draw = mp.solutions.drawing_utils while True: success, image = cap.read() if not success: print("Failed to capture image") break image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) results = hands.process(image_rgb) # print(results.multi_hand_landmarks if results.multi_hand_landmarks: for hand_landmarks in results.multi_hand_landmarks: mp_draw.draw_landmarks(image_rgb, hand_landmarks, mp_hands.HAND_CONNECTIONS) cv2.imshow("hand tracking", image_rgb) cv2.waitKey(1) At the end it will look like: Step 4: Image processing This step is very important in our project. First, let’s talk a little about our hands. As you saw before, our hand has many points called landmarks. These landmarks tell us where different parts of the hand are, like the tips of the fingers or the joints. However, these landmarks are shown as decimal values, which don’t make much sense to us. So, what we need to do now is find the exact position of each point (landmark) in the form of pixel numbers. Once we have these pixel values, we can use them to: Know if the fingers are open or closed. Know if the hand is a right hand or a left hand. This will help us make our hand tracking more useful for different applications. in our current for loop we will add this nested for loop: for id, lm in enumerate(hand_landmarks.landmark): h, w, c = image_rgb.shape cx, cy = int(lm.x * w), int(lm.y * h) lmList.append([id, cx, cy]) mp_draw.draw_landmarks(image_rgb, hand_landmarks, mp_hands.HAND_CONNECTIONS) for id, lm in enumerate(hand_landmarks.landmark): This loop goes through each landmark (point) on the hand detected by the model. enumerate() gives two things: id (the index number of the landmark) and lm (the actual position of the landmark). There are 21 landmarks on the hand, and this loop goes through each of them, one by one. h, w, c = image_rgb.shape This line gets the height (h), width (w), and number of color channels (c) from the image. These values are used to calculate the correct position of the landmarks in pixels. The image shape tells us the size of the image. We need this information to convert the hand landmarks (which are in decimal form) into actual pixel coordinates on the image. cx, cy = int(lm.x * w), int(lm.y * h) Each landmark (lm) has x and y values between 0 and 1. These are normalized values (they tell you the location as a percentage of the image size). To convert these values into actual pixel positions (cx, cy), we multiply the x-value by the width of the image (w) and the y-value by the height of the image (h). int() is used to round the values down to whole numbers because pixels must be integers (whole numbers). lmList.append([id, cx, cy]) This line adds the landmark’s id, cx (x-coordinate in pixels), and cy (y-coordinate in pixels) to a list called lmList. So, lmList will hold the information about all the landmarks (their index number and position in pixels). mp_draw.draw_landmarks(image_rgb, hand_landmarks, mp_hands.HAND_CONNECTIONS) This command uses MediaPipe’s drawing utility to draw the landmarks and connections between them on the image. mp_hands.HAND_CONNECTIONS defines the lines that connect the landmarks (like the connections between fingers and joints). This step is what visually shows the hand landmarks and connections on the screen. Step 5: Write function for landermarks. Purpose of the Function: This function, called findPosition(), helps make things easier by finding the positions of the hand landmarks (important points on the hand) and returning them as a list of pixel coordinates. If you want, it can also draw circles on these points to show them visually. How the Function Works: def findPosition(img, draw=False): This is the function definition. It takes two things as input: img: This is the image (or video frame) where we want to find the landmarks. draw: This is an optional argument. If set to True, it will draw circles on the detected landmarks. lmList = [] This creates an empty list called lmList, which will store the coordinates of the hand landmarks. if results.multi_hand_landmarks: This checks if the hand landmarks are detected in the image. If hands are detected, the code inside this block will run. Loop through each detected hand and its landmarks: The function loops through all detected hands in the image (for hand_landmarks in results.multi_hand_landmarks:). Then, it loops through the 21 landmarks on each hand (for id, lm in enumerate(hand_landmarks.landmark):). Calculate the pixel coordinates of each landmark: h, w, c = image_RGB.shape: This gets the height, width, and number of color channels from the image. cx, cy = int(lm.x * w), int(lm.y * h): This converts the landmark's x and y values (which are in decimal form) into pixel coordinates by multiplying them with the width and height of the image. Store the landmark information: lmList.append([id, cx, cy]): This adds the landmark’s id (its index number) and its pixel coordinates cx and cy to the list lmList. Draw the landmarks and connections: mp_draw.draw_landmarks(image_RGB, hand_landmarks, mp_hands.HAND_CONNECTIONS): This draws the hand landmarks and connects them with lines, showing how they are connected (for example, between fingers and joints). Optionally draw circles: if draw:: If the draw argument is set to True, it will draw a circle on each landmark with a radius of 15 and a purple color ((255, 0, 255)). Return the landmark list: return lmList: After processing, the function returns the list lmList, which contains all the detected hand landmarks and their pixel coordinates. Why this function is useful: It simplifies the process of finding hand landmarks. It helps us work with pixel positions instead of raw data. It can also visually show the landmarks with circles, making it easier to understand where the key points are on the hand. The script after using the function: import cv2 import mediapipe as mp def findPosition(img, draw=False): lmList = [] if results.multi_hand_landmarks: for hand_landmarks in results.multi_hand_landmarks: for id, lm in enumerate(hand_landmarks.landmark): # print(id, lm) h, w, c = image_rgb.shape cx, cy = int(lm.x * w), int(lm.y * h) lmList.append([id, cx, cy]) # print(lmList) mp_draw.draw_landmarks(image_rgb, hand_landmarks, mp_hands.HAND_CONNECTIONS) if draw: cv2.circle(img, (cx, cy), 15, (255, 0, 255), cv2.FILLED) return lmList cap = cv2.VideoCapture(0) cap.set(3, 640) cap.set(4, 480) mp_hands = mp.solutions.hands hands = mp_hands.Hands() mp_draw = mp.solutions.drawing_utils while True: success, image = cap.read() if not success: print("Failed to capture image") break image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) results = hands.process(image_rgb) # print(results.multi_hand_landmarks) cv2.imshow("hand tracking", image_rgb) cv2.waitKey(1) if we take look again here: Before while True let's declare a list contain the id numbers of our five fingers 4, 8, 12, 16, 20. all what we have to do right now is to compare the number of pixel of our id(for example 8) with the less point id-2 (8)